pytorch · kartikayk · Mar 24, 2024 · Mar 23, 2024 · Mar 24, 2024 · Mar 24, 2024
diff --git a/README.md b/README.md
@@ -38,7 +38,10 @@ The library currently supports the following models and fine-tuning methods.
 
 | Model | Sizes | Finetuning Methods |
 |-----------------------------------------------|-----------|-----------------------------------------------------------|
-| [Llama2](torchtune/models/llama2.py) | 7B | Full Finetuning [[single device](recipes/full_finetune_single_device.py), [distributed](recipes/full_finetune_distributed.py)], LoRA [[single device](recipes/lora_finetune_single_device.py), [distributed](recipes/lora_finetune_distributed.py)], QLoRA [single device](recipes/lora_finetune_single_device.py) |
+| [Llama2](torchtune/models/llama2/_model_builders.py) | 7B | Full Finetuning [[single device](recipes/configs/llama2/7B_full_single_device.yaml), [distributed](recipes/configs/llama2/7B_full.yaml)] LoRA [[single device](recipes/configs/llama2/7B_lora_single_device.yaml), [distributed](recipes/configs/llama2/7B_lora.yaml)] QLoRA [single device](recipes/configs/llama2/7B_qlora_single_device.yaml) |
+| [Llama2](torchtune/models/llama2/_model_builders.py) | 13B | [Full Finetuning](recipes/configs/llama2/13B_full.yaml), [LoRA](recipes/configs/llama2/13B_lora.yaml)
+| [Mistral](torchtune/models/mistral//_model_builders.py) | 7B | Full Finetuning and LoRA are WIP and will be added soon
+
 
 &nbsp;
 
@@ -49,11 +52,11 @@ experience different peak memory utilization based on changes made in configurat
 
 | Example HW Resources | Finetuning Method | Config | Model Size | Peak Memory per GPU
 |--------------|-------------------|---------|------------|---------------------|
-| 1 x RTX 4090 | QLoRA | [qlora_finetune_single_device](https:/pytorch/torchtune/blob/main/recipes/configs/qlora_finetune_single_device.yaml) | 7B | 9.29 GB * |
-| 2 x RTX 4090 | LoRA | [lora_finetune_distributed](https:/pytorch/torchtune/blob/main/recipes/configs/lora_finetune_distributed.yaml) | 7B | 14.17 GB * |
-| 1 x RTX 4090 | LoRA | [lora_finetune_single_device](https:/pytorch/torchtune/blob/main/recipes/configs/lora_finetune_single_device.yaml) | 7B | 17.18 GB * |
-| 1 x A6000 | Full finetune | [full_finetune_single_device](https:/pytorch/torchtune/blob/main/recipes/configs/full_finetune_single_device.yaml) | 7B | 27.15 GB * |
-| 4 x RTX 4090 | Full finetune | [full_finetune_distributed](https:/pytorch/torchtune/blob/main/recipes/configs/full_finetune_distributed.yaml) | 7B | 12.01 GB * |
+| 1 x RTX 4090 | QLoRA | [qlora_finetune_single_device](https:/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_qlora_single_device.yaml) | 7B | 9.29 GB * |
+| 2 x RTX 4090 | LoRA | [lora_finetune_distributed](https:/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_lora.yaml) | 7B | 14.17 GB * |
+| 1 x RTX 4090 | LoRA | [lora_finetune_single_device](https:/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_lora_single_device.yaml) | 7B | 17.18 GB * |
+| 1 x A6000 | Full finetune | [full_finetune_single_device](https:/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_full_single_device.yaml) | 7B | 27.15 GB * |
+| 4 x RTX 4090 | Full finetune | [full_finetune_distributed](https:/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_full.yaml) | 7B | 12.01 GB * |
 
 
 NOTE: * indicates an estimated metric based on experiments conducted on A100 GPUs with GPU memory artificially limited using [torch.cuda.set_per_process_memory_fraction API](https://pytorch.org/docs/stable/generated/torch.cuda.set_per_process_memory_fraction.html). Peak memory per GPU is as reported by `torch.cuda.max_memory_reserved()`. Please file an issue if you are not able to reproduce these results when running TorchTune on certain hardware.
@@ -117,32 +120,32 @@ Note: While the ``tune download`` command allows you to download *any* model fro
 TorchTune contains recipes for:
 - Full finetuning on [single device](https:/pytorch/torchtune/blob/main/recipes/full_finetune_single_device.py) and on [multiple devices with FSDP](https:/pytorch/torchtune/blob/main/recipes/full_finetune_distributed.py)
 - LoRA finetuning on [single device](https:/pytorch/torchtune/blob/main/recipes/lora_finetune_single_device.py) and on [multiple devices with FSDP](https:/pytorch/torchtune/blob/main/recipes/lora_finetune_distributed.py).
-- QLoRA finetuning on [single device](https:/pytorch/torchtune/blob/main/recipes/lora_finetune_single_device.py), with a QLoRA specific [configuration](https:/pytorch/torchtune/blob/main/recipes/configs/qlora_finetune_single_device.yaml)
+- QLoRA finetuning on [single device](https:/pytorch/torchtune/blob/main/recipes/lora_finetune_single_device.py), with a QLoRA specific [configuration](https:/pytorch/torchtune/blob/main/recipes/configs/7B_qlora_single_device.yaml)
 
-To run a full finetune on two devices on the Alpaca dataset using FSDP:
+To run a full finetune on two devices on the Alpaca dataset using the Llama2 7B model and FSDP:
 
 ```
 tune --nnodes 1 --nproc_per_node 2 \
 full_finetune_distributed \
---config full_finetune_distributed
+--config llama2/7B_full
 ```
 
 The argument passed to `--nproc_per_node` can be varied depending on how many GPUs you have. A full finetune can be memory-intensive, so make sure you are running on enough devices. See [this table](https:/pytorch/torchtune/blob/main/README.md#finetuning-resource-requirements) for resource requirements on common hardware setups.
 
-Similarly, you can finetune with LoRA on the Alpaca dataset on two devices via the following.
+Similarly, you can finetune with LoRA on the Alpaca dataset using the Llama2 13B model on two devices via the following.
 
 ```
 tune --nnodes 1 --nproc_per_node 2 \
 lora_finetune_distributed \
---config lora_finetune_distributed
+--config llama2/13B_lora
 ```
 
 Again, the argument to `--nproc_per_node` can be varied subject to memory constraints of your device(s).
 
 An example to run QLoRA on a single device can be achieved with the following:
 
 ```
-tune lora_finetune_single_device --config recipes/configs/qlora_finetune_single_device.yaml
+tune lora_finetune_single_device --config recipes/configs/llama2/7B_qlora_single_device
 ```
 
 &nbsp;
@@ -152,8 +155,8 @@ tune lora_finetune_single_device --config recipes/configs/qlora_finetune_single_
 To copy a recipe to customize it yourself and then run
 ```
 tune cp full_finetune_distributed.py my_recipe/full_finetune_distributed.py
-tune cp full_finetune_distributed.yaml my_recipe/full_finetune_distributed.yaml
-tune my_recipe/full_finetune_distributed.py --config my_recipe/full_finetune_distributed.yaml
+tune cp llama2/7B_full.yaml my_recipe/7B_full.yaml
+tune my_recipe/full_finetune_distributed.py --config my_recipe/7B_full.yaml
 ```
 
 &nbsp;

diff --git a/docs/source/examples/configs.rst b/docs/source/examples/configs.rst
@@ -161,7 +161,7 @@ will list out all the locations where an error was found.
 
 .. code-block:: bash
 
- tune validate --config recipes/configs/full_finetune_single_device.yaml batch_size=4
+ tune validate --config recipes/configs/llama2/7B_full.yaml batch_size=4
 
 Best practices for writing configs
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

diff --git a/docs/source/examples/first_finetune_tutorial.rst b/docs/source/examples/first_finetune_tutorial.rst
@@ -78,7 +78,7 @@ It looks like there's already a config called :code:`alpaca_llama_full_finetune`
 
 .. code-block:: bash
 
- tune cp full_finetune_distributed.yaml custom_config.yaml
+ tune cp llama2/7B_full.yaml custom_config.yaml
 
 Now you can update the custom YAML config to point to your model and tokenizer. While you're at it,
 you can make some other changes, like setting the random seed in order to make replication easier,

diff --git a/docs/source/examples/lora_finetune.rst b/docs/source/examples/lora_finetune.rst
@@ -258,7 +258,7 @@ You can then run the following command to perform a LoRA finetune of Llama2-7B u
 .. note::
  Make sure to point to the location of your Llama2 weights and tokenizer. This can be done
  either by adding :code:`checkpointer.checkpoint_files=[my_model_checkpoint_path] tokenizer_checkpoint=my_tokenizer_checkpoint_path`
- or by directly modifying the :code:`lora_finetune_distributed.yaml` file. See our :ref:`config_tutorial_label`
+ or by directly modifying the :code:`7B_lora.yaml` file. See our :ref:`config_tutorial_label`
  for more details on how you can easily clone and modify TorchTune configs.
 
 .. note::

diff --git a/docs/source/examples/recipe_deepdive.rst b/docs/source/examples/recipe_deepdive.rst
@@ -43,7 +43,7 @@ Each recipe consists of three components:
 
 In the following sections, we'll take a closer look at each of these components. For a complete working example, refer to the
 `full finetuning recipe <https:/pytorch/torchtune/blob/main/recipes/full_finetune_distributed.py>`_ in TorchTune and the associated
-`config <https:/pytorch/torchtune/blob/main/recipes/configs/full_finetune_distributed.yaml>`_.
+`config <https:/pytorch/torchtune/blob/main/recipes/configs/7B_full.yaml>`_.
 
 
 What Recipes are not?

diff --git a/recipes/README.md b/recipes/README.md
@@ -6,7 +6,7 @@
 
 Recipes are the primary entry points for TorchTune users. These can be thought of as end-to-end pipelines for training and optionally evaluating LLMs. Each recipe consists of three components:
 
-- **Configurable parameters**, specified through yaml configs [example](https:/pytorch/torchtune/blob/main/recipes/configs/full_finetune_distributed.yaml) and command-line overrides
+- **Configurable parameters**, specified through yaml configs [example](https:/pytorch/torchtune/blob/main/recipes/configs/llama2/7B_full.yaml) and command-line overrides
 - **Recipe class**, core logic needed for training, exposed to users through a set of APIs [interface](https:/pytorch/torchtune/blob/main/recipes/interfaces.py)
 - **Recipe script**, puts everything together including parsing and validating configs, setting up the environment, and correctly using the recipe class
 

diff --git a/recipes/configs/llama2/13B_full.yaml b/recipes/configs/llama2/13B_full.yaml
@@ -0,0 +1,85 @@
+# Config for multi-device full finetuning in full_finetune_distributed.py
+# using a Llama2 13B model
+#
+# This config assumes that you've run the following command before launching
+# this run:
+# tune download --repo-id meta-llama/Llama-2-13b-hf \
+# --hf-token <HF_TOKEN> \
+# --output-dir /tmp/llama2-13b-hf
+#
+# To launch on 4 devices, run the following command from root:
+# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \
+# --config llama2/13B_full \
+#
+# You can add specific overrides through the command line. For example
+# to override the checkpointer directory while launching training
+# you can run:
+# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \
+# --config llama2/13B_full \
+# checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
+#
+# This config should be used with 2+ GPUs. Single device full fine-tuning
+# requires several memory optimizations which are exposed through
+# 7B_full_single_device.yaml. Please update the model and checkpoints to 13B
+# in that config.
+
+
+# Tokenizer
+tokenizer:
+ _component_: torchtune.models.llama2.llama2_tokenizer
+ path: /tmp/llama2/tokenizer.model
+
+# Dataset
+dataset:
+ _component_: torchtune.datasets.alpaca_dataset
+ train_on_input: True
+seed: null
+shuffle: True
+
+# Model Arguments
+model:
+ _component_: torchtune.models.llama2.llama2_13b
+
+checkpointer:
+ _component_: torchtune.utils.FullModelHFCheckpointer
+ checkpoint_dir: /tmp/llama2-13b-hf/
+ checkpoint_files: [
+ pytorch_model-00001-of-00003.bin,
+ pytorch_model-00002-of-00003.bin,
+ pytorch_model-00003-of-00003.bin
+ ]
+ recipe_checkpoint: null
+ output_dir: /tmp/llama2-13b-hf/
+ model_type: LLAMA2
+resume_from_checkpoint: False
+
+# Fine-tuning arguments
+batch_size: 2
+epochs: 3
+optimizer:
+ _component_: torch.optim.AdamW
+ lr: 2e-5
+loss:
+ _component_: torch.nn.CrossEntropyLoss
+max_steps_per_epoch: null
+gradient_accumulation_steps: 1
+
+
+# Training env
+device: cuda
+
+# Distributed
+cpu_offload: False
+
+# Memory management
+enable_activation_checkpointing: True
+
+# Reduced precision
+dtype: bf16
+
+# Logging
+metric_logger:
+ _component_: torchtune.utils.metric_logging.DiskLogger
+ log_dir: ${output_dir}
+output_dir: /tmp/alpaca-llama2-finetune
+log_every_n_steps: null
diff --git a/recipes/configs/llama2/13B_lora.yaml b/recipes/configs/llama2/13B_lora.yaml
@@ -0,0 +1,90 @@
+# Config for multi-device LoRA in lora_finetune_distributed.py
+# using a Llama2 13B model
+#
+# This config assumes that you've run the following command before launching
+# this run:
+# tune download --repo-id meta-llama/Llama-2-13b-hf \
+# --hf-token <HF_TOKEN> \
+# --output-dir /tmp/llama2-13b-hf
+#
+# To launch on 4 devices, run the following command from root:
+# tune --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \
+# --config llama2/13B_lora \
+#
+# You can add specific overrides through the command line. For example
+# to override the checkpointer directory while launching training
+# you can run:
+# tune --nnodes 1 --nproc_per_node 4 lora_finetune_distributed \
+# --config llama2/13B_lora \
+# checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
+#
+# This config works best when the model is being fine-tuned on 2+ GPUs.
+# For single device lora finetuning please use 7B_lora_single_device.yaml
+# or 7B_qlora_single_device.yaml and update the model and checkpoints to
+# the 13B model.
+
+
+# Model Arguments
+model:
+ _component_: torchtune.models.llama2.lora_llama2_13b
+ lora_attn_modules: ['q_proj', 'v_proj', 'k_proj']
+ apply_lora_to_mlp: True
+ apply_lora_to_output: True
+ lora_rank: 8
+ lora_alpha: 16
+
+checkpointer:
+ _component_: torchtune.utils.FullModelHFCheckpointer
+ checkpoint_dir: /tmp/llama2-13b-hf/
+ checkpoint_files: [
+ pytorch_model-00001-of-00003.bin,
+ pytorch_model-00002-of-00003.bin,
+ pytorch_model-00003-of-00003.bin
+ ]
+ adapter_checkpoint: null
+ recipe_checkpoint: null
+ output_dir: /tmp/llama2-13b-hf/
+ model_type: LLAMA2
+resume_from_checkpoint: False
+
+# Tokenizer
+tokenizer:
+ _component_: torchtune.models.llama2.llama2_tokenizer
+ path: /tmp/llama2/tokenizer.model
+
+# Dataset and Sampler
+dataset:
+ _component_: torchtune.datasets.alpaca_dataset
+ train_on_input: True
+ use_clean: True
+seed: null
+shuffle: True
+batch_size: 32
+
+# Optimizer and Scheduler
+optimizer:
+ _component_: torch.optim.AdamW
+ weight_decay: 0.01
+ lr: 2e-4
+lr_scheduler:
+ _component_: torchtune.modules.get_cosine_schedule_with_warmup
+ num_warmup_steps: 100
+
+loss:
+ _component_: torch.nn.CrossEntropyLoss
+
+# Training
+epochs: 1
+max_steps_per_epoch: null
+
+# Logging
+output_dir: /tmp/lora_finetune_output
+metric_logger:
+ _component_: torchtune.utils.metric_logging.DiskLogger
+ log_dir: ${output_dir}
+log_every_n_steps: null
+
+# Environment
+device: cuda
+dtype: bf16
+enable_activation_checkpointing: False
diff --git a/recipes/configs/llama2/7B_full.yaml b/recipes/configs/llama2/7B_full.yaml
@@ -0,0 +1,80 @@
+# Config for multi-device full finetuning in full_finetune_distributed.py
+# using a Llama2 7B model
+#
+# This config assumes that you've run the following command before launching
+# this run:
+# tune download --repo-id meta-llama/Llama-2-7b \
+# --hf-token <HF_TOKEN> \
+# --output-dir /tmp/llama2
+#
+# To launch on 4 devices, run the following command from root:
+# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \
+# --config llama2/7B_full \
+#
+# You can add specific overrides through the command line. For example
+# to override the checkpointer directory while launching training
+# you can run:
+# tune --nnodes 1 --nproc_per_node 4 full_finetune_distributed \
+# --config llama2/7B_full \
+# checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
+#
+# This config works best when the model is being fine-tuned on 2+ GPUs.
+# Single device full finetuning requires more memory optimizations. It's
+# best to use 7B_full_single_device.yaml for those cases
+
+
+# Tokenizer
+tokenizer:
+ _component_: torchtune.models.llama2.llama2_tokenizer
+ path: /tmp/llama2/tokenizer.model
+
+# Dataset
+dataset:
+ _component_: torchtune.datasets.alpaca_dataset
+ train_on_input: True
+seed: null
+shuffle: True
+
+# Model Arguments
+model:
+ _component_: torchtune.models.llama2.llama2_7b
+
+checkpointer:
+ _component_: torchtune.utils.FullModelMetaCheckpointer
+ checkpoint_dir: /tmp/llama2
+ checkpoint_files: [consolidated.00.pth]
+ recipe_checkpoint: null
+ output_dir: /tmp/llama2
+ model_type: LLAMA2
+resume_from_checkpoint: False
+
+# Fine-tuning arguments
+batch_size: 2
+epochs: 3
+optimizer:
+ _component_: torch.optim.AdamW
+ lr: 2e-5
+loss:
+ _component_: torch.nn.CrossEntropyLoss
+max_steps_per_epoch: null
+gradient_accumulation_steps: 1
+
+
+# Training env
+device: cuda
+
+# Distributed
+cpu_offload: False
+
+# Memory management
+enable_activation_checkpointing: True
+
+# Reduced precision
+dtype: bf16
+
+# Logging
+metric_logger:
+ _component_: torchtune.utils.metric_logging.DiskLogger
+ log_dir: ${output_dir}
+output_dir: /tmp/alpaca-llama2-finetune
+log_every_n_steps: null