Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize configs and add Llama2 13B + Mistral 7B #571

Merged
merged 9 commits into from
Mar 24, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 17 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,10 @@ The library currently supports the following models and fine-tuning methods.

| Model | Sizes | Finetuning Methods |
|-----------------------------------------------|-----------|-----------------------------------------------------------|
| [Llama2](torchtune/models/llama2.py) | 7B | Full Finetuning [[single device](recipes/full_finetune_single_device.py), [distributed](recipes/full_finetune_distributed.py)], LoRA [[single device](recipes/lora_finetune_single_device.py), [distributed](recipes/lora_finetune_distributed.py)], QLoRA [single device](recipes/lora_finetune_single_device.py) |
| [Llama2](torchtune/models/llama2/_model_builders.py) | 7B | Full Finetuning [[single device](recipes/configs/llama2/7B_full_single_device.yaml), [distributed](recipes/configs/llama2/7B_full.yaml)] LoRA [[single device](recipes/configs/llama2/7B_lora_single_device.yaml), [distributed](recipes/configs/llama2/7B_lora.yaml)] QLoRA [single device](recipes/configs/llama2/7B_qlora_single_device.yaml) |
| [Llama2](torchtune/models/llama2/_model_builders.py) | 13B | [Full Finetuning](recipes/configs/llama2/13B_full.yaml), [LoRA](recipes/configs/llama2/13B_lora.yaml)
| [Mistral](torchtune/models/mistral//_model_builders.py) | 7B | Full Finetuning and LoRA are WIP and will be added soon


 

Expand All @@ -49,11 +52,11 @@ experience different peak memory utilization based on changes made in configurat

| Example HW Resources | Finetuning Method | Config | Model Size | Peak Memory per GPU
|--------------|-------------------|---------|------------|---------------------|
| 1 x RTX 4090 | QLoRA | [qlora_finetune_single_device](https:/pytorch/torchtune/blob/main/recipes/configs/qlora_finetune_single_device.yaml) | 7B | 9.29 GB * |
| 2 x RTX 4090 | LoRA | [lora_finetune_distributed](https:/pytorch/torchtune/blob/main/recipes/configs/lora_finetune_distributed.yaml) | 7B | 14.17 GB * |
| 1 x RTX 4090 | LoRA | [lora_finetune_single_device](https:/pytorch/torchtune/blob/main/recipes/configs/lora_finetune_single_device.yaml) | 7B | 17.18 GB * |
| 1 x A6000 | Full finetune | [full_finetune_single_device](https:/pytorch/torchtune/blob/main/recipes/configs/full_finetune_single_device.yaml) | 7B | 27.15 GB * |
| 4 x RTX 4090 | Full finetune | [full_finetune_distributed](https:/pytorch/torchtune/blob/main/recipes/configs/full_finetune_distributed.yaml) | 7B | 12.01 GB * |
| 1 x RTX 4090 | QLoRA | [qlora_finetune_single_device](https:/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_qlora_single_device.yaml) | 7B | 9.29 GB * |
| 2 x RTX 4090 | LoRA | [lora_finetune_distributed](https:/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_lora.yaml) | 7B | 14.17 GB * |
| 1 x RTX 4090 | LoRA | [lora_finetune_single_device](https:/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_lora_single_device.yaml) | 7B | 17.18 GB * |
| 1 x A6000 | Full finetune | [full_finetune_single_device](https:/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_full_single_device.yaml) | 7B | 27.15 GB * |
| 4 x RTX 4090 | Full finetune | [full_finetune_distributed](https:/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_full.yaml) | 7B | 12.01 GB * |


NOTE: * indicates an estimated metric based on experiments conducted on A100 GPUs with GPU memory artificially limited using [torch.cuda.set_per_process_memory_fraction API](https://pytorch.org/docs/stable/generated/torch.cuda.set_per_process_memory_fraction.html). Peak memory per GPU is as reported by `torch.cuda.max_memory_reserved()`. Please file an issue if you are not able to reproduce these results when running TorchTune on certain hardware.
Expand Down Expand Up @@ -117,32 +120,32 @@ Note: While the ``tune download`` command allows you to download *any* model fro
TorchTune contains recipes for:
- Full finetuning on [single device](https:/pytorch/torchtune/blob/main/recipes/full_finetune_single_device.py) and on [multiple devices with FSDP](https:/pytorch/torchtune/blob/main/recipes/full_finetune_distributed.py)
- LoRA finetuning on [single device](https:/pytorch/torchtune/blob/main/recipes/lora_finetune_single_device.py) and on [multiple devices with FSDP](https:/pytorch/torchtune/blob/main/recipes/lora_finetune_distributed.py).
- QLoRA finetuning on [single device](https:/pytorch/torchtune/blob/main/recipes/lora_finetune_single_device.py), with a QLoRA specific [configuration](https:/pytorch/torchtune/blob/main/recipes/configs/qlora_finetune_single_device.yaml)
- QLoRA finetuning on [single device](https:/pytorch/torchtune/blob/main/recipes/lora_finetune_single_device.py), with a QLoRA specific [configuration](https:/pytorch/torchtune/blob/main/recipes/configs/7B_qlora_single_device.yaml)

To run a full finetune on two devices on the Alpaca dataset using FSDP:
To run a full finetune on two devices on the Alpaca dataset using the Llama2 7B model and FSDP:

```
tune --nnodes 1 --nproc_per_node 2 \
full_finetune_distributed \
--config full_finetune_distributed
--config llama2/7B_full
```

The argument passed to `--nproc_per_node` can be varied depending on how many GPUs you have. A full finetune can be memory-intensive, so make sure you are running on enough devices. See [this table](https:/pytorch/torchtune/blob/main/README.md#finetuning-resource-requirements) for resource requirements on common hardware setups.

Similarly, you can finetune with LoRA on the Alpaca dataset on two devices via the following.
Similarly, you can finetune with LoRA on the Alpaca dataset using the Llama2 13B model on two devices via the following.

```
tune --nnodes 1 --nproc_per_node 2 \
lora_finetune_distributed \
--config lora_finetune_distributed
--config llama2/13B_lora
```

Again, the argument to `--nproc_per_node` can be varied subject to memory constraints of your device(s).

An example to run QLoRA on a single device can be achieved with the following:

```
tune lora_finetune_single_device --config recipes/configs/qlora_finetune_single_device.yaml
tune lora_finetune_single_device --config recipes/configs/llama2/7B_qlora_single_device
```

 
Expand All @@ -152,8 +155,8 @@ tune lora_finetune_single_device --config recipes/configs/qlora_finetune_single_
To copy a recipe to customize it yourself and then run
```
tune cp full_finetune_distributed.py my_recipe/full_finetune_distributed.py
tune cp full_finetune_distributed.yaml my_recipe/full_finetune_distributed.yaml
tune my_recipe/full_finetune_distributed.py --config my_recipe/full_finetune_distributed.yaml
tune cp llama2/7B_full.yaml my_recipe/7B_full.yaml
tune my_recipe/full_finetune_distributed.py --config my_recipe/7B_full.yaml
```

 
Expand Down
2 changes: 1 addition & 1 deletion docs/source/examples/configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ will list out all the locations where an error was found.

.. code-block:: bash

tune validate --config recipes/configs/full_finetune_single_device.yaml batch_size=4
tune validate --config recipes/configs/llama2/7B_full.yaml batch_size=4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to watch out for: there can be lots of issues when filenames start with a number. I've definitely seen it as a problem with Python imports, maybe it will be ok with YAML files? But something to keep in mind

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting - what sort of issues? But yeh I don't expect us to be importing the configs anymore?


Best practices for writing configs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
2 changes: 1 addition & 1 deletion docs/source/examples/first_finetune_tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ It looks like there's already a config called :code:`alpaca_llama_full_finetune`

.. code-block:: bash

tune cp full_finetune_distributed.yaml custom_config.yaml
tune cp llama2/7B_full.yaml custom_config.yaml

Now you can update the custom YAML config to point to your model and tokenizer. While you're at it,
you can make some other changes, like setting the random seed in order to make replication easier,
Expand Down
2 changes: 1 addition & 1 deletion docs/source/examples/lora_finetune.rst
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,7 @@ You can then run the following command to perform a LoRA finetune of Llama2-7B u
.. note::
Make sure to point to the location of your Llama2 weights and tokenizer. This can be done
either by adding :code:`checkpointer.checkpoint_files=[my_model_checkpoint_path] tokenizer_checkpoint=my_tokenizer_checkpoint_path`
or by directly modifying the :code:`lora_finetune_distributed.yaml` file. See our :ref:`config_tutorial_label`
or by directly modifying the :code:`7B_lora.yaml` file. See our :ref:`config_tutorial_label`
for more details on how you can easily clone and modify TorchTune configs.

.. note::
Expand Down
2 changes: 1 addition & 1 deletion docs/source/examples/recipe_deepdive.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Each recipe consists of three components:

In the following sections, we'll take a closer look at each of these components. For a complete working example, refer to the
`full finetuning recipe <https:/pytorch/torchtune/blob/main/recipes/full_finetune_distributed.py>`_ in TorchTune and the associated
`config <https:/pytorch/torchtune/blob/main/recipes/configs/full_finetune_distributed.yaml>`_.
`config <https:/pytorch/torchtune/blob/main/recipes/configs/7B_full.yaml>`_.


What Recipes are not?
Expand Down
2 changes: 1 addition & 1 deletion recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

Recipes are the primary entry points for TorchTune users. These can be thought of as end-to-end pipelines for training and optionally evaluating LLMs. Each recipe consists of three components:

- **Configurable parameters**, specified through yaml configs [example](https:/pytorch/torchtune/blob/main/recipes/configs/full_finetune_distributed.yaml) and command-line overrides
- **Configurable parameters**, specified through yaml configs [example](https:/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_full.yaml) and command-line overrides
- **Recipe class**, core logic needed for training, exposed to users through a set of APIs [interface](https:/pytorch/torchtune/blob/main/recipes/interfaces.py)
- **Recipe script**, puts everything together including parsing and validating configs, setting up the environment, and correctly using the recipe class

Expand Down
85 changes: 85 additions & 0 deletions recipes/configs/llama2/13B_full.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Config for multi-device full finetuning in full_finetune_distributed.py
# using a Llama2 13B model
#
# This config assumes that you've run the following command before launching
# this run:
# tune download --repo-id meta-llama/Llama-2-13b-hf \
# --hf-token <HF_TOKEN> \
# --output-dir /tmp/llama2-13b-hf
#
# To launch on 4 devices, run the following command from root:
# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \
# --config llama2/13B_full \
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \
# --config llama2/13B_full \
# checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config should be used with 2+ GPUs. Single device full fine-tuning
# requires several memory optimizations which are exposed through
# 7B_full_single_device.yaml. Please update the model and checkpoints to 13B
# in that config.
Comment on lines +21 to +24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess implicit in our choice of naming here is that >1 device is kind of now the "default", right? While I understand that we are doing more memory optimizations in the single device recipes now, we've obviously seen that FSDP comes with its own nuances too. So I do wonder if it's now hard for someone to just come in and say "give me a simple single-device recipe to get started on"

This is also a bit weird for QLoRA imo where we currently only support single device

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh this is a good question. I did this primarily for two reasons:

  • With the distributed CI testing sorted out, I removed the constraint on the distributed recipes. We can now run those on single device but without the memory optimizations.
  • As we go to larger models, the single device setting will be less frequent. So when I thought about the default, distributed seem like the more natural one.

Does this make sense?



# Tokenizer
tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
path: /tmp/llama2/tokenizer.model

# Dataset
dataset:
_component_: torchtune.datasets.alpaca_dataset
train_on_input: True
seed: null
shuffle: True

# Model Arguments
model:
_component_: torchtune.models.llama2.llama2_13b

checkpointer:
_component_: torchtune.utils.FullModelHFCheckpointer
checkpoint_dir: /tmp/llama2-13b-hf/
checkpoint_files: [
pytorch_model-00001-of-00003.bin,
pytorch_model-00002-of-00003.bin,
pytorch_model-00003-of-00003.bin
]
recipe_checkpoint: null
output_dir: /tmp/llama2-13b-hf/
model_type: LLAMA2
resume_from_checkpoint: False

# Fine-tuning arguments
batch_size: 2
epochs: 3
optimizer:
_component_: torch.optim.AdamW
lr: 2e-5
loss:
_component_: torch.nn.CrossEntropyLoss
max_steps_per_epoch: null
gradient_accumulation_steps: 1


# Training env
device: cuda

# Distributed
cpu_offload: False

# Memory management
enable_activation_checkpointing: True

# Reduced precision
dtype: bf16

# Logging
metric_logger:
_component_: torchtune.utils.metric_logging.DiskLogger
log_dir: ${output_dir}
output_dir: /tmp/alpaca-llama2-finetune
log_every_n_steps: null
90 changes: 90 additions & 0 deletions recipes/configs/llama2/13B_lora.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Config for multi-device LoRA in lora_finetune_distributed.py
# using a Llama2 13B model
#
# This config assumes that you've run the following command before launching
# this run:
# tune download --repo-id meta-llama/Llama-2-13b-hf \
# --hf-token <HF_TOKEN> \
# --output-dir /tmp/llama2-13b-hf
#
# To launch on 4 devices, run the following command from root:
# tune --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \
# --config llama2/13B_lora \
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \
# --config llama2/13B_lora \
# checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works best when the model is being fine-tuned on 2+ GPUs.
# For single device lora finetuning please use 7B_lora_single_device.yaml
# or 7B_qlora_single_device.yaml and update the model and checkpoints to
# the 13B model.


# Model Arguments
model:
_component_: torchtune.models.llama2.lora_llama2_13b
lora_attn_modules: ['q_proj', 'v_proj', 'k_proj']
apply_lora_to_mlp: True
apply_lora_to_output: True
lora_rank: 8
lora_alpha: 16

checkpointer:
_component_: torchtune.utils.FullModelHFCheckpointer
checkpoint_dir: /tmp/llama2-13b-hf/
checkpoint_files: [
pytorch_model-00001-of-00003.bin,
pytorch_model-00002-of-00003.bin,
pytorch_model-00003-of-00003.bin
]
adapter_checkpoint: null
recipe_checkpoint: null
output_dir: /tmp/llama2-13b-hf/
model_type: LLAMA2
resume_from_checkpoint: False

# Tokenizer
tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
path: /tmp/llama2/tokenizer.model

# Dataset and Sampler
dataset:
_component_: torchtune.datasets.alpaca_dataset
train_on_input: True
use_clean: True
seed: null
shuffle: True
batch_size: 32

# Optimizer and Scheduler
optimizer:
_component_: torch.optim.AdamW
weight_decay: 0.01
lr: 2e-4
lr_scheduler:
_component_: torchtune.modules.get_cosine_schedule_with_warmup
num_warmup_steps: 100

loss:
_component_: torch.nn.CrossEntropyLoss

# Training
epochs: 1
max_steps_per_epoch: null

# Logging
output_dir: /tmp/lora_finetune_output
metric_logger:
_component_: torchtune.utils.metric_logging.DiskLogger
log_dir: ${output_dir}
log_every_n_steps: null

# Environment
device: cuda
dtype: bf16
enable_activation_checkpointing: False
80 changes: 80 additions & 0 deletions recipes/configs/llama2/7B_full.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Config for multi-device full finetuning in full_finetune_distributed.py
# using a Llama2 7B model
#
# This config assumes that you've run the following command before launching
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add this huge block in the config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there's no documentation on the configs at all. Once we have live docs available, we can add these to the docs. But for now, I'd like to give some understanding to users about when and how to use each config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's still some information in the README on configs and we can make that more clear. I think cluttering up the configs can be overwhelming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it be overwhelming? Isn't it just documentation? I don't think we would be able to add config-level info to the README?

# this run:
# tune download --repo-id meta-llama/Llama-2-7b \
# --hf-token <HF_TOKEN> \
# --output-dir /tmp/llama2
#
# To launch on 4 devices, run the following command from root:
# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \
# --config llama2/7B_full \
#
# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \
# --config llama2/7B_full \
# checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config works best when the model is being fine-tuned on 2+ GPUs.
# Single device full finetuning requires more memory optimizations. It's
# best to use 7B_full_single_device.yaml for those cases


# Tokenizer
tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
path: /tmp/llama2/tokenizer.model

# Dataset
dataset:
_component_: torchtune.datasets.alpaca_dataset
train_on_input: True
seed: null
shuffle: True

# Model Arguments
model:
_component_: torchtune.models.llama2.llama2_7b

checkpointer:
_component_: torchtune.utils.FullModelMetaCheckpointer
checkpoint_dir: /tmp/llama2
checkpoint_files: [consolidated.00.pth]
recipe_checkpoint: null
output_dir: /tmp/llama2
model_type: LLAMA2
resume_from_checkpoint: False

# Fine-tuning arguments
batch_size: 2
epochs: 3
optimizer:
_component_: torch.optim.AdamW
lr: 2e-5
loss:
_component_: torch.nn.CrossEntropyLoss
max_steps_per_epoch: null
gradient_accumulation_steps: 1


# Training env
device: cuda

# Distributed
cpu_offload: False

# Memory management
enable_activation_checkpointing: True

# Reduced precision
dtype: bf16

# Logging
metric_logger:
_component_: torchtune.utils.metric_logging.DiskLogger
log_dir: ${output_dir}
output_dir: /tmp/alpaca-llama2-finetune
log_every_n_steps: null
Loading
Loading