Missing lm_head.weight key when using Gemma 7B distributed LoRA recipe with gemma-7b-it #1122

aubreyjstrier · 2024-06-26T14:41:05Z

Hi,

I'm having an issue with using the distributed LoRA recipe with instruction-tuned Gemma 7B via the torchtune CLI. The unexpected behavior is very similar to the bug raised in issue #1062 and solved in PR #1064, but instead of the lm_head.weight key being missing from the state_dict, it's missing from the weight_map field of the checkpointer.

After training is complete, torchtune attempts to split the state_dict beginning at line 506 and indexes into self._weight_map at line 508, at which point it errors:

split_state_dicts: Dict[str, Dict[str, torch.Tensor]] = {}
        for key, weight in state_dict[utils.MODEL_KEY].items():
            cpt_idx = self._weight_map[key]  ### fails here
            if cpt_idx not in split_state_dicts:
                split_state_dicts[cpt_idx] = {}
            split_state_dicts[cpt_idx].update({key: weight})

It raises: [rank0]: KeyError: 'lm_head.weight'

I'm using the default config from the recipes folder, except that the checkpointer reads from and outputs to a directory where I've downloaded gemma-7b-it straight from the HF hub. This is on the latest build of torchtune.

Thanks in advance for your help!

The text was updated successfully, but these errors were encountered:

SalmanMohammadi · 2024-06-27T19:59:50Z

Hey @aubreyjstrier. Thanks for raising this.

I think you're right about it being similar to the earlier issue. Re-reading my earlier fix and some of the HF code, I don't actually think we needed those changes. Huggingface models should generally be loaded and saved using from_pretrained, and save_pretrained, which take care of tied embedding weights - the issue at hand was loading weights directly into the model. The Gemma 2B checkpoint doesn't have a keys for an output projection.

At first glance: I think the reason it's failing here is because we build a map of key : {checkpoint_file_id} for the model state dict, so we know to save tensor_0 in checkpoint 0001 and so on. When saving the checkpoint, we don't have an entry for lm_head since we didn't load one from the HF checkpoint in the first place.

I'll have a closer look later but I would be tempted to revert the changes in my PR with some additional testing. cc @pbontrager @ebsmothers

pbontrager · 2024-07-02T15:46:58Z

@SalmanMohammadi can you share which PR you're referring to? I assume we can update the checkpoint mapping but I want to see the change.

SalmanMohammadi · 2024-07-02T16:24:23Z

@SalmanMohammadi can you share which PR you're referring to? I assume we can update the checkpoint mapping but I want to see the change.

#1064

I think the issue that the PR was solving was due to the way the user was loading weights into a HF model (i.e. not using from_pretrained, rather than anything to do with our checkpointing.

joecummings · 2024-07-11T21:12:39Z

I'm running into this issue, as well. Will update this Issue as I figure things out.

joecummings · 2024-07-11T22:11:39Z

Reverting #1064 gets the following error trace

Traceback (most recent call last):
  File "/home/jrcummings/.conda/envs/joe-torchtune/bin/tune", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/jrcummings/projects/joe-torchtune/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/home/jrcummings/projects/joe-torchtune/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/home/jrcummings/projects/joe-torchtune/torchtune/_cli/run.py", line 179, in _run_cmd
    self._run_single_device(args)
  File "/home/jrcummings/projects/joe-torchtune/torchtune/_cli/run.py", line 93, in _run_single_device
    runpy.run_path(str(args.recipe), run_name="__main__")
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/home/jrcummings/projects/joe-torchtune/recipes/lora_finetune_single_device.py", line 648, in <module>
    sys.exit(recipe_main())
             ^^^^^^^^^^^^^
  File "/home/jrcummings/projects/joe-torchtune/torchtune/config/_parse.py", line 50, in wrapper
    sys.exit(recipe_main(conf))
             ^^^^^^^^^^^^^^^^^
  File "/home/jrcummings/projects/joe-torchtune/recipes/lora_finetune_single_device.py", line 643, in recipe_main
    recipe.train()
  File "/home/jrcummings/projects/joe-torchtune/recipes/lora_finetune_single_device.py", line 625, in train
    self.save_checkpoint(epoch=curr_epoch)
  File "/home/jrcummings/projects/joe-torchtune/recipes/lora_finetune_single_device.py", line 517, in save_checkpoint
    self._checkpointer.save_checkpoint(
  File "/home/jrcummings/projects/joe-torchtune/torchtune/utils/_checkpointing/_checkpointer.py", line 535, in save_checkpoint
    ] = convert_weights.tune_to_peft_adapter_weights(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jrcummings/projects/joe-torchtune/torchtune/models/convert_weights.py", line 282, in tune_to_peft_adapter_weights
    value = _permute_lora_matrix(value, num_heads)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jrcummings/projects/joe-torchtune/torchtune/models/convert_weights.py", line 274, in _permute_lora_matrix
    t.view(n_heads, head_dim // 2, 2, rank)
RuntimeError: shape '[16, 96, 2, 8]' is invalid for input of size 32768

SalmanMohammadi · 2024-07-11T22:52:18Z

Reverting #1064 gets the following error trace

This also looks like a different error to the original poster - strange. FYI the small test I wrote for Gemma checkpointing passes if I use the original checkpointing logic - https://gist.github.com/SalmanMohammadi/e59ff24add75d37a2b81eeaccbff057c.

ebsmothers · 2024-07-12T05:21:47Z

Just getting caught up on some of this.. I think we should treat this as two separate pieces.

(1) we should revert #1064. I don't think the usage of load_state_dict in #1062 is the correct way to load from a torchtune model with tied weights into an HF model with tied weights. Imo we should adhere to the format actually provided on the hub, which does not contain lm_head.weight. So I think we were doing the correct thing before. The reason HF's from_pretrained works with this checkpoint is because of their usage of _tied_weights_keys (as pointed out by @SalmanMohammadi in the original PR). But I don't think we should try to replicate this on our end; better to adhere to the contract that we return the same state dict format at the end of training as what we got in (i.e. one that doesn't contain lm_head.weight). If we want to load a torchtune checkpoint directly into the CausalLM version of Gemma (which has patched in an extra weight on the backend), it's natural to ask the user to write the same corresponding glue code.

(2) the new error seems to be coming from our PEFT integration. I already chatted with @joecummings about this a bit, but because Gemma 7B does not satisfy num_heads * head_dim = embed_dim, we may need to change this function for saving our LoRA weights into PEFT format. I imagine we will want to just explicitly pass head_dim instead of inferring it like we currently do. Note that we only need to permute the LoRA B matrix (exercise to the reader to figure out why) so we don't need to worry about permuting the A matrix, which now will not have any knowledge of num_heads or head_dim

felipemello1 added bug Something isn't working question Further information is requested labels Jul 2, 2024

felipemello1 assigned pbontrager Jul 2, 2024

ebsmothers mentioned this issue Jul 12, 2024

ALLGATHER_BASE timeout error #1165

Closed

SalmanMohammadi mentioned this issue Jul 12, 2024

Reverting Gemma checkpointing logic #1168

Merged

11 tasks

ebsmothers mentioned this issue Jul 12, 2024

Fix Gemma 7B checkpoint save #1169

Merged

joecummings mentioned this issue Jul 12, 2024

Gemma2B missing lm head weight? #1062

Closed

joecummings closed this as completed in #1169 Jul 12, 2024

This was referenced Aug 21, 2024

Add back in lm_head.weight in Qwen2 after training #1381

Closed

Streamline/better documentation for torchtune -> transformers workflow #1388

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing lm_head.weight key when using Gemma 7B distributed LoRA recipe with gemma-7b-it #1122

Missing lm_head.weight key when using Gemma 7B distributed LoRA recipe with gemma-7b-it #1122

aubreyjstrier commented Jun 26, 2024

SalmanMohammadi commented Jun 27, 2024

pbontrager commented Jul 2, 2024

SalmanMohammadi commented Jul 2, 2024

joecummings commented Jul 11, 2024

joecummings commented Jul 11, 2024

SalmanMohammadi commented Jul 11, 2024

ebsmothers commented Jul 12, 2024

Missing lm_head.weight key when using Gemma 7B distributed LoRA recipe with gemma-7b-it #1122

Missing lm_head.weight key when using Gemma 7B distributed LoRA recipe with gemma-7b-it #1122

Comments

aubreyjstrier commented Jun 26, 2024

SalmanMohammadi commented Jun 27, 2024

pbontrager commented Jul 2, 2024

SalmanMohammadi commented Jul 2, 2024

joecummings commented Jul 11, 2024

joecummings commented Jul 11, 2024

SalmanMohammadi commented Jul 11, 2024

ebsmothers commented Jul 12, 2024