[Break BC] Create training directory, move checkpointing #1432

RdoubleA · 2024-08-28T20:45:10Z

Context

The big kahuna of refactors.

Motivation is discussed extensively in #1421. Here, we only move the checkpointing directory into training. This should break most cycles since it depends on models. Unfortunately, this touches all configs... will run a few to make sure none break.

All references to torchtune.utils.FullModelXXXCheckpointer now becomes torchtune.training.FullModelXXXCheckpointer

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.)

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Example of docstring:

torchtune/torchtune/modules/vision_transformer.py

Line 285 in 6a7951f

Examples:

Example in our docs: https://pytorch.org/torchtune/main/tutorials/qat_finetune.html#applying-qat-to-llama3-models

I did not change any public API;
I have added an example to docs or docstrings;

pytorch-bot · 2024-08-28T20:45:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1432

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9dfe15f with merge base 929a45a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

codecov-commenter · 2024-08-28T23:25:50Z

Codecov Report

Attention: Patch coverage is 14.61538% with 111 lines in your changes missing coverage. Please review.

Project coverage is 72.24%. Comparing base (badca1c) to head (4ce3be1).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
torchtune/training/checkpointing/_checkpointer.py	31.57%	26 Missing ⚠️
recipes/ppo_full_finetune_single_device.py	0.00%	13 Missing ⚠️
recipes/lora_finetune_distributed.py	0.00%	11 Missing ⚠️
recipes/lora_finetune_single_device.py	0.00%	11 Missing ⚠️
recipes/full_finetune_single_device.py	0.00%	10 Missing ⚠️
recipes/lora_dpo_distributed.py	0.00%	10 Missing ⚠️
recipes/lora_dpo_single_device.py	0.00%	10 Missing ⚠️
recipes/full_finetune_distributed.py	0.00%	8 Missing ⚠️
recipes/qat_distributed.py	0.00%	8 Missing ⚠️
recipes/eleuther_eval.py	0.00%	2 Missing ⚠️
... and 2 more

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1432       +/-   ##
===========================================
+ Coverage   27.00%   72.24%   +45.23%     
===========================================
  Files         268      269        +1     
  Lines       12923    12930        +7     
===========================================
+ Hits         3490     9341     +5851     
+ Misses       9433     3589     -5844

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ebsmothers

A few small comments, but modulo resolving merge conflicts and (still) green CI this looks good to me

ebsmothers · 2024-08-29T21:41:49Z

torchtune/training/__init__.py

+__all__ = [
+ "FullModelHFCheckpointer",
+ "FullModelMetaCheckpointer",
+ "FullModelTorchTuneCheckpointer",
+ "ModelType",
+ "Checkpointer",
+ "update_state_dict_for_classifier",
+ "ADAPTER_CONFIG",
+ "ADAPTER_KEY",
+ "EPOCHS_KEY",
+ "MAX_STEPS_KEY",
+ "MODEL_KEY",
+ "OPT_KEY",
+ "RNG_KEY",
+ "SEED_KEY",
+ "STEPS_KEY",
+ "TOTAL_EPOCHS_KEY",
+]


This is different than what we do now, right? Rn we don't include checkpointer APIs in the parent's __all__. Not saying it's wrong to do it this way but just wanna understand the rationale for the change

mainly to keep it consistent, but not opposed to removing these

ebsmothers · 2024-08-29T21:42:49Z

torchtune/training/checkpointing/__init__.py

@@ -5,12 +5,12 @@
 # LICENSE file in the root directory of this source tree.
 from typing import Union

-from torchtune.utils._checkpointing._checkpointer import (
+from torchtune.training.checkpointing._checkpointer import (


Where'd my underscore go

ebsmothers · 2024-08-29T22:03:23Z

docs/source/deep_dives/checkpointer.rst

@@ -381,7 +381,7 @@ looks something like this:
 checkpointer:

 # checkpointer to use
- _component_: torchtune.utils.FullModelHFCheckpointer


dam good eye

ebsmothers · 2024-08-29T22:15:50Z

torchtune/training/checkpointing/_checkpointer.py

@@ -13,13 +13,13 @@

 import torch
 from safetensors.torch import save_file
-from torchtune import utils
+from torchtune import training


Not a huge deal (given the scope of other stuff you're doing here) but it's a little weird to me that we use training.MODEL_KEY etc. when all these things are literally defined in the local directory. Feels like needless indirection to me

eh, later problem

refactor checkpointing

a0d1300

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 28, 2024

RdoubleA added 2 commits August 28, 2024 15:10

fix tests, docs

e137a6f

fix tests

4ce3be1

ebsmothers approved these changes Aug 29, 2024

View reviewed changes

RdoubleA added 2 commits August 29, 2024 16:36

Merge branch 'main' into training

8db9b44

address comments

9dfe15f

RdoubleA merged commit 5e2046f into pytorch:main Aug 30, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Break BC] Create training directory, move checkpointing #1432

[Break BC] Create training directory, move checkpointing #1432

RdoubleA commented Aug 28, 2024

pytorch-bot bot commented Aug 28, 2024 •

edited

Loading

codecov-commenter commented Aug 28, 2024

ebsmothers left a comment

ebsmothers Aug 29, 2024

RdoubleA Aug 29, 2024

ebsmothers Aug 29, 2024

ebsmothers Aug 29, 2024

RdoubleA Aug 29, 2024

ebsmothers Aug 29, 2024

RdoubleA Aug 29, 2024

[Break BC] Create training directory, move checkpointing #1432

[Break BC] Create training directory, move checkpointing #1432

Conversation

RdoubleA commented Aug 28, 2024

Context

Test plan

UX

pytorch-bot bot commented Aug 28, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1432

✅ No Failures

codecov-commenter commented Aug 28, 2024

Codecov Report

ebsmothers left a comment

Choose a reason for hiding this comment

ebsmothers Aug 29, 2024

Choose a reason for hiding this comment

RdoubleA Aug 29, 2024

Choose a reason for hiding this comment

ebsmothers Aug 29, 2024

Choose a reason for hiding this comment

ebsmothers Aug 29, 2024

Choose a reason for hiding this comment

RdoubleA Aug 29, 2024

Choose a reason for hiding this comment

ebsmothers Aug 29, 2024

Choose a reason for hiding this comment

RdoubleA Aug 29, 2024

Choose a reason for hiding this comment

pytorch-bot bot commented Aug 28, 2024 •

edited

Loading