Support Optimizer-in-the-backward #1737

mori360 · 2024-10-01T22:29:52Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Enable Optimizer-in-the-backward for full_finetune_distributed

Changelog

Update full_finetune_distributed for enabling Optimizer-in-the-backward
Update test_full_finetune_distributed with _optimizer_in_bwd config
updated test_distributed to test running with/without optimized_in_the_backward, and performance after saving-loading state_dict.

Test plan

Test running with optimizer_in_the_backward: tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full fsdp_cpu_offload=False max_steps_per_epoch=2 optimizer_in_bwd=True
Test running optimizer_in_the_backward with resume_from_checkpoint: tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full fsdp_cpu_offload=False max_steps_per_epoch=2 epochs=10 optimizer_in_bwd=True resume_from_checkpoint=True checkpointer.recipe_checkpoint=/tmp/Llama-2-7b-hf/recipe_state.pt checkpointer.checkpoint_files=[hf_model_0001_1.pt,hf_model_0002_1.pt]
Verify that running with Optimizer-in-the-backward could have the same loss, model_state_dict and optimizer_state_dict, model after saving and loading could also have the same: pytest tests/torchtune/training/test_distributed.py -k test_optimizer_in_backward

Memory cost analysis:
With each layer gradient cost 193MB memory, the origin(left) case has the peak memory at the 31th layer with accumulation of 193MB memory times 30.
The right case with optimizer-in-the-backward frees these memory during backward, gets lower peak memory.

Training time and loss analysis:

pytorch-bot · 2024-10-01T22:29:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1737

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit f639b6d with merge base f639b6d ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Regression Tests / regression_test (3.11, nightly) (gh) (trunk failure)
tests/regression_tests/test_llama2_7b.py::TestLoRA7BDistributedFinetuneEval::test_finetune_and_eval
Regression Tests / regression_test (3.11, stable) (gh) (trunk failure)
tests/regression_tests/test_llama2_7b.py::TestLoRA7BDistributedFinetuneEval::test_finetune_and_eval

This comment was automatically generated by Dr. CI and updates every 15 minutes.

codecov-commenter · 2024-10-02T22:55:09Z

Codecov Report

Attention: Patch coverage is 1.97368% with 149 lines in your changes missing coverage. Please review.

Project coverage is 25.44%. Comparing base (7cf656b) to head (207b1b1).
Report is 21 commits behind head on main.

Files with missing lines	Patch %	Lines
tests/torchtune/training/test_distributed.py	2.88%	101 Missing ⚠️
recipes/full_finetune_distributed.py	0.00%	48 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1737       +/-   ##
===========================================
- Coverage   69.33%   25.44%   -43.89%     
===========================================
  Files         305      305               
  Lines       15892    16089      +197     
===========================================
- Hits        11018     4094     -6924     
- Misses       4874    11995     +7121

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

recipes/full_finetune_distributed.py

tests/recipes/test_full_finetune_distributed.py

weifengpy · 2024-10-09T17:36:38Z

could we draw loss curves in weights & bias to showcase numerics are the same with/without optimizer-in-the-backward?

mori360 · 2024-10-10T00:42:18Z

could we draw loss curves in weights & bias to showcase numerics are the same with/without optimizer-in-the-backward?

The loss curves have been added in the comments section of the third column on the right-hand side table.

recipes/full_finetune_distributed.py

ebsmothers · 2024-10-10T19:39:03Z

recipes/full_finetune_distributed.py

- optimizer,
- opt_state_dict,
- self._device,
+ if not optimizer_in_bwd:


This is a nit, but let's switch the order of this if/else so that it's if optimizer_in_bwd. Then it's closer to what's in the single-device recipe and easier to compare between the two

We have the initial zero out at single device at:

torchtune/recipes/full_finetune_single_device.py

Lines 591 to 593 in 665ab3f

# zero out the gradients before starting training

if not self._optimizer_in_bwd:

self._optimizer.zero_grad()

recipes/full_finetune_distributed.py

ebsmothers · 2024-10-10T19:42:01Z

recipes/full_finetune_distributed.py

+ raise NotImplementedError(
+ "Gradient clipping is not supported after optimizer-in-the-backward."
+ )


Is this actually due to optimizer in backward, or something else? I don't think we have such a check in the single-device recipe

It's due to optimizer-in-backward.
optimizer-in-backward calls .step() and .zero_grad() during loss.backward(), thus grads are None at the time after to call torch.nn.utils.clip_grad_norm_, could not have gradient clipping successfully.
Single_device has the same issue that if optimizer_in_bwd=True and clip_grad_norm=True

ebsmothers · 2024-10-10T19:44:12Z

recipes/full_finetune_distributed.py

@@ -722,6 +775,27 @@ def train(self) -> None:

 self._profiler.stop()

+ def get_lr_scheduler(self) -> float:


If we are gonna have this as a utility I would take it out of this recipe since it's equally applicable to the single-device case. Also I wouldn't call it get_lr_scheduler, since we're really getting the current lr, not the scheduler itself.

Single_device get lr with the assumption that all optimizers have the same LR.

Shall we apply the logic here to single device, rename to get_lr and move to /torchtune/utils?
or just taking the same assumption as single device?

Technically I think this was also previously the assumption in this recipe, right? Since we just log the LR from the first param group. So we should be able to maintain the same logic for both

ebsmothers · 2024-10-10T19:48:53Z

tests/torchtune/training/test_distributed.py

+ version.parse(torch.__version__).base_version < "2.5.0",
+ reason="torch >= 2.5 required",
+ )
+ def test_optimizer_in_backward(self):


I'm wondering why we create a whole new test in here rather than adding a distributed test case in the existing TestOptimInBackward test case (or if we want to inherit from FSDPTest, a new distributed version of that class in the same file). Because testing optimizer-in-backward as part of a class that is otherwise meant to test our fully_shard + state dict save and load logic feels a bit strange to me.

The test here wants to proof that models running with optimzier-in-backward could have the same performance as running without.
The state_dict saving and loading want to test optimzier-in-backward's wrapper which is a bit different to the traditional optim.
There are not covered in TestOptimInBackward.

If we want to do such an end-to-end test, maybe we can add a recipe test case instead? See e.g. test_full_finetune_distributed.py here: with this test case you can run the full end-to-end recipe on a small test model and set optimizer_in_bwd=True directly from the config

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 1, 2024

mori360 marked this pull request as ready for review October 7, 2024 22:24

mori360 marked this pull request as draft October 7, 2024 23:02

mori360 marked this pull request as ready for review October 8, 2024 21:56

weifengpy reviewed Oct 9, 2024

View reviewed changes

recipes/full_finetune_distributed.py Outdated Show resolved Hide resolved

weifengpy reviewed Oct 9, 2024

View reviewed changes

recipes/full_finetune_distributed.py Outdated Show resolved Hide resolved

weifengpy reviewed Oct 9, 2024

View reviewed changes

recipes/full_finetune_distributed.py Outdated Show resolved Hide resolved

weifengpy reviewed Oct 9, 2024

View reviewed changes

recipes/full_finetune_distributed.py Outdated Show resolved Hide resolved

weifengpy reviewed Oct 9, 2024

View reviewed changes

tests/recipes/test_full_finetune_distributed.py Outdated Show resolved Hide resolved

mori360 marked this pull request as draft October 9, 2024 18:59

mori360 force-pushed the backward branch from 4c4c02e to 5b06ca2 Compare October 9, 2024 22:37

mori360 marked this pull request as ready for review October 10, 2024 00:59

mori360 requested review from ebsmothers and awgu October 10, 2024 01:17

ebsmothers reviewed Oct 10, 2024

View reviewed changes

mori360 marked this pull request as draft October 10, 2024 21:47

mori360 closed this Oct 14, 2024

mori360 force-pushed the backward branch from 6b388b8 to f639b6d Compare October 14, 2024 22:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Optimizer-in-the-backward #1737

Support Optimizer-in-the-backward #1737

mori360 commented Oct 1, 2024 •

edited

Loading

pytorch-bot bot commented Oct 1, 2024 •

edited

Loading

codecov-commenter commented Oct 2, 2024 •

edited

Loading

weifengpy commented Oct 9, 2024

mori360 commented Oct 10, 2024

ebsmothers Oct 10, 2024

mori360 Oct 10, 2024

ebsmothers Oct 10, 2024

mori360 Oct 10, 2024

ebsmothers Oct 10, 2024

mori360 Oct 10, 2024

ebsmothers Oct 11, 2024

ebsmothers Oct 10, 2024

mori360 Oct 10, 2024

ebsmothers Oct 11, 2024

	# zero out the gradients before starting training
	if not self._optimizer_in_bwd:
	self._optimizer.zero_grad()

		@@ -722,6 +775,27 @@ def train(self) -> None:

		self._profiler.stop()

		def get_lr_scheduler(self) -> float:

Support Optimizer-in-the-backward #1737

Support Optimizer-in-the-backward #1737

Conversation

mori360 commented Oct 1, 2024 • edited Loading

Context

Changelog

Test plan

pytorch-bot bot commented Oct 1, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1737

✅ You can merge normally! (2 Unrelated Failures)

codecov-commenter commented Oct 2, 2024 • edited Loading

Codecov Report

weifengpy commented Oct 9, 2024

mori360 commented Oct 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mori360 commented Oct 1, 2024 •

edited

Loading

pytorch-bot bot commented Oct 1, 2024 •

edited

Loading

codecov-commenter commented Oct 2, 2024 •

edited

Loading