Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Break BC][RFC] utils directory refactor #1421

Closed
RdoubleA opened this issue Aug 27, 2024 · 3 comments
Closed

[Break BC][RFC] utils directory refactor #1421

RdoubleA opened this issue Aug 27, 2024 · 3 comments

Comments

@RdoubleA
Copy link
Contributor

RdoubleA commented Aug 27, 2024

Collecting all the discussions around this here from many private channels. @ebsmothers has already written up an excellent quick RFC on this topic (#1414) but I wanted to formalize the discussion here.

Problem

torchtune/utils/ is a massive directory with a wide spectrum of helpers, from things like shard_model for distributed training to get_logger to retrieve Python's built-in logger... this is not great. It's served as a "i don't know where else to put this function" type of folder but this is leading to tangible problems:

  • Hinders discoverability of code since a lot of important functionality for running our recipes is found under utils, which is not the first place users will look nor is it specific enough to know how to navigate
  • Complicates our dependency graph, leading to many cycles that are created as we continue to inflate the directory. Utils depends on models and modules, but also everything else in the library depends on utils, so it's very easy to create cycles.

Ideally, the utils directory remains a miscellaneous collection of helpers that any other directory in the library can take a dependency on, but this isn't the case with folders like utils/_checkpointing and utils/_distributed.py depending on models/ and modules/

Here is the current dependency graph (A -> B means B depends on A). Notice how easy it is to create a cycle.
image
If data took a dependency on utils, you would get a cycle. modules and models can never take a dependency on utils otherwise there's a clear cycle. modules and models cannot take a dependency on config or dataset because that would create a cycle. This is not scalable and is currently blocking important features, like #1193, and adding unnecessary tech debt, like deprecate being placed in data instead of utils to avoid a cycle (see #1286)

Approach

We need a way to restructure our dependency graph. However, this would induce a massive refactor and would undoubtedly break BC. So there's a couple of options.

  1. Rename utils/_checkpointing -> utils/checkpointing and remove checkpointing imports from utils/__init__.py. Import from utils/checkpointing directly. This removes the utils-models dependency and stops a common cause of cycles. This is the easiest approach, but will still require updating all our configs, will break BC, does not address the utils-modules dependency, nor the core problem of utils being bloated
  2. Move all utils that depend on other directories to new folder: training. Keep utils as a directory that does not take a dependency on any other directory. This fundamentally restructures our dependency graph and will prevent further cycles.
    image

Since both approaches are breaking BC, we may as well only break it a single time and fundamentally fix the problem, so I propose 2. We can debate the actual name of the new folder. Some options:

  • training
  • training_utils
  • framework (similar to torchtnt)
  • recipe_utils

This new directory will contains all utilities related to training and are used in recipes. So these are the new locations for the files in utils:

training

  • _checkpointing
    • constants.py
  • _device.py
  • _distributed.py
  • _profiler.py
  • activations.py
  • memory.py
  • metric_logging.py
  • pooling.py
  • precision.py
  • quantization.py
  • seed.py

generation

  • _generation.py

config

  • argparse.py

data

  • collate.py

utils

  • _version.py
  • logging.py

Distributed utilities could possibly be in their own folder since these are usually shared across training and inference recipes, and it may be odd to import from training in a generate/inference recipe.

@kartikayk @ebsmothers @felipemello1 @pbontrager

@pbontrager
Copy link
Contributor

As a bonus, can this include replace the generic logger everywhere it's cropped up with the utils.logger?

@RdoubleA
Copy link
Contributor Author

As a bonus, can this include replace the generic logger everywhere it's cropped up with the utils.logger?

Sorry, what do you mean exactly? do you have an example?

@pbontrager
Copy link
Contributor

Here is one example. The utils version makes sure the logger is setup right for distributed runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants