[Break BC][RFC] utils directory refactor #1421

RdoubleA · 2024-08-27T23:33:14Z

Collecting all the discussions around this here from many private channels. @ebsmothers has already written up an excellent quick RFC on this topic (#1414) but I wanted to formalize the discussion here.

Problem

torchtune/utils/ is a massive directory with a wide spectrum of helpers, from things like shard_model for distributed training to get_logger to retrieve Python's built-in logger... this is not great. It's served as a "i don't know where else to put this function" type of folder but this is leading to tangible problems:

Hinders discoverability of code since a lot of important functionality for running our recipes is found under utils, which is not the first place users will look nor is it specific enough to know how to navigate
Complicates our dependency graph, leading to many cycles that are created as we continue to inflate the directory. Utils depends on models and modules, but also everything else in the library depends on utils, so it's very easy to create cycles.

Ideally, the utils directory remains a miscellaneous collection of helpers that any other directory in the library can take a dependency on, but this isn't the case with folders like utils/_checkpointing and utils/_distributed.py depending on models/ and modules/

Here is the current dependency graph (A -> B means B depends on A). Notice how easy it is to create a cycle.

If data took a dependency on utils, you would get a cycle. modules and models can never take a dependency on utils otherwise there's a clear cycle. modules and models cannot take a dependency on config or dataset because that would create a cycle. This is not scalable and is currently blocking important features, like #1193, and adding unnecessary tech debt, like deprecate being placed in data instead of utils to avoid a cycle (see #1286)

Approach

We need a way to restructure our dependency graph. However, this would induce a massive refactor and would undoubtedly break BC. So there's a couple of options.

Rename utils/_checkpointing -> utils/checkpointing and remove checkpointing imports from utils/__init__.py. Import from utils/checkpointing directly. This removes the utils-models dependency and stops a common cause of cycles. This is the easiest approach, but will still require updating all our configs, will break BC, does not address the utils-modules dependency, nor the core problem of utils being bloated
Move all utils that depend on other directories to new folder: training. Keep utils as a directory that does not take a dependency on any other directory. This fundamentally restructures our dependency graph and will prevent further cycles.

Since both approaches are breaking BC, we may as well only break it a single time and fundamentally fix the problem, so I propose 2. We can debate the actual name of the new folder. Some options:

training
training_utils
framework (similar to torchtnt)
recipe_utils

This new directory will contains all utilities related to training and are used in recipes. So these are the new locations for the files in utils:

training

_checkpointing
- constants.py
_device.py
_distributed.py
_profiler.py
activations.py
memory.py
metric_logging.py
pooling.py
precision.py
quantization.py
seed.py

generation

_generation.py

config

argparse.py

data

collate.py

utils

_version.py
logging.py

Distributed utilities could possibly be in their own folder since these are usually shared across training and inference recipes, and it may be odd to import from training in a generate/inference recipe.

@kartikayk @ebsmothers @felipemello1 @pbontrager

The text was updated successfully, but these errors were encountered:

pbontrager · 2024-08-28T17:13:17Z

As a bonus, can this include replace the generic logger everywhere it's cropped up with the utils.logger?

RdoubleA · 2024-08-28T17:15:06Z

As a bonus, can this include replace the generic logger everywhere it's cropped up with the utils.logger?

Sorry, what do you mean exactly? do you have an example?

pbontrager · 2024-08-28T17:23:28Z

Here is one example. The utils version makes sure the logger is setup right for distributed runs.

This was referenced Aug 27, 2024

Move collate to data #1422

Merged

Move argparse to config #1423

Merged

SalmanMohammadi mentioned this issue Aug 28, 2024

[RFC] Batched inference 🤝 KV-cache 🤝 compile #1424

Merged

13 tasks

RdoubleA mentioned this issue Aug 28, 2024

Move utils/constants to checkpointing #1427

Merged

13 tasks

RdoubleA closed this as completed Sep 3, 2024

This was referenced Sep 6, 2024

utils refactor clean-up #1515

Merged

no new cycles #1519

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Break BC][RFC] utils directory refactor #1421

[Break BC][RFC] utils directory refactor #1421

RdoubleA commented Aug 27, 2024 •

edited

Loading

pbontrager commented Aug 28, 2024

RdoubleA commented Aug 28, 2024

pbontrager commented Aug 28, 2024

[Break BC][RFC] utils directory refactor #1421

[Break BC][RFC] utils directory refactor #1421

Comments

RdoubleA commented Aug 27, 2024 • edited Loading

Problem

Approach

pbontrager commented Aug 28, 2024

RdoubleA commented Aug 28, 2024

pbontrager commented Aug 28, 2024

RdoubleA commented Aug 27, 2024 •

edited

Loading