-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-tuning pretrained model #51
Comments
Hello, python train.py --name test_run --load /path/to/mqan_decanlp_better_sampling_cove_cpu/iteration_560000.pth --resume --device 0 --cove --train_tasks new_task put I received the following error message: ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group I have double check the parameters in config.json in mqan_decanlp_better_sampling_cove_cpu. What could be the problem ? Am I missing something ? Thank you in advance ! |
These queries are quite old, but I've been hitting the same problems. Posting some answers in case they might be useful for others. @ashleyyy94 Looks like you were running without the --cove parameter. The pretrained model you were trying to use had used cove, so you need to also use cove to continue training. @hot-cheeto I have been hitting this problem too. I don't fully understand it, but it seems to be possible to work around this by dropping the "resume" parameter. The issue seems to be that the stored state for the optimizer has a mismatching number of parameters. It has 153 parameters.
Whereas if I start training with the same parameters as you, the optimizer state only has 137 parameters (16 fewer).
I have not yet understood what accounts for these extra 16 optimizer parameters. So I have no idea how to correct for them. But I believe it is reasonable to discard the optimizer state and continue training from the model state, and seem to have got some reasonable results doing so. You can do that by dropping the --resume parameter. Once you have a learning checkpoint that you have generated yourself, you can then continue from this checkpoint using the --resume parameter. |
Adding a note that you can't set "strict=False" on the call to load_state_dict for the optimizer. The reason why is explained here: pytorch/pytorch#3852. I am suspicious there has been some change to the model since the pre-trained data referenced in the ReadMe was generated. The pre-trained data logs say: What I see when training is:
The 3.5M difference in the number of trainable parameters seems concerning too, and this doesn't seem to be down to configuration (unless I have missed something) There have been quite a few changes to the repo since 26 Oct 2018 (including a bunch on Oct 26 itself). I've not analyzed them all, but it seems plausible that one of these changes might have resulted in the incompatibility of the optimizer's stored state, causing this problem. |
I'm trying to fine-tune the pretrained model provided with my custom dataset. The command is
nvidia-docker run -it --rm -v
pwd:/decaNLP/ -u $(id -u):$(id -g) bmccann/decanlp:cuda9_torch041 bash -c "python /decaNLP/train.py --load /decaNLP/mqan_decanlp_better_sampling_cove_cpu/iteration_560000.pth --resume --train_tasks mwo
While trying to initialise the MQAN model, it throws up this error:
RuntimeError: Error(s) in loading state_dict for MultitaskQuestionAnsweringNetwork: Missing key(s) in state_dict: "encoder_embeddings.projection.linear.weight", "encoder_embeddings.projection.linear.bias". Unexpected key(s) in state_dict: "cove.rnn1.weight_ih_l0", "cove.rnn1.weight_hh_l0", "cove.rnn1.bias_ih_l0", "cove.rnn1.bias_hh_l0", "cove.rnn1.weight_ih_l0_reverse", "cove.rnn1.weight_hh_l0_reverse", "cove.rnn1.bias_ih_l0_reverse", "cove.rnn1.bias_hh_l0_reverse", "cove.rnn1.weight_ih_l1", "cove.rnn1.weight_hh_l1", "cove.rnn1.bias_ih_l1", "cove.rnn1.bias_hh_l1", "cove.rnn1.weight_ih_l1_reverse", "cove.rnn1.weight_hh_l1_reverse", "cove.rnn1.bias_ih_l1_reverse", "cove.rnn1.bias_hh_l1_reverse", "project_cove.linear.weight", "project_cove.linear.bias".
Kindly advise how to go about fine-tuning the model. Thank you.
The text was updated successfully, but these errors were encountered: