Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretraining with option --n-save-every still saves all models #5280

Closed
chopeen opened this issue Apr 9, 2020 · 12 comments
Closed

Pretraining with option --n-save-every still saves all models #5280

chopeen opened this issue Apr 9, 2020 · 12 comments
Labels
enhancement Feature requests and improvements feat / cli Feature: Command-line interface feat / serialize Feature: Serialization, saving and loading feat / tok2vec Feature: Token-to-vector layer and pretraining help wanted Contributions welcome!

Comments

@chopeen
Copy link
Contributor

chopeen commented Apr 9, 2020

I am running the following command:

!spacy pretrain $FILE_RF_SENTENCES en_core_sci_lg $DIR_MODELS_RF_SENT \
    --use-vectors --n-save-every 5

In order to use less space, I specified the option --n-save-every to save a model every X batches.

However, all models are still saved, with additional .temp.bin files:

$ ls -al /kaggle/working/models/tok2vec_rf_sent_sci
[snip]
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model110.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model110.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model111.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model111.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model112.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model112.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model113.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model113.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model114.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model114.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model115.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model115.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model116.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model116.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model117.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model117.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model118.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model118.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model119.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model119.temp.bin
[snip]

Info about spaCy

  • spaCy version: 2.2.4
  • Platform: Linux-4.19.112+-x86_64-with-debian-9.9
  • Python version: 3.6.6

Notebook

https://www.kaggle.com/chopeen/spacy-with-gpu-support/

@svlandeg svlandeg added feat / cli Feature: Command-line interface feat / serialize Feature: Serialization, saving and loading feat / tok2vec Feature: Token-to-vector layer and pretraining bug Bugs and behaviour differing from documentation labels Apr 9, 2020
@svlandeg
Copy link
Member

Thanks for the report! I can replicate this - will look into it.

@svlandeg
Copy link
Member

svlandeg commented Apr 10, 2020

Hm, it looks like this is actually the intended behaviour: #3510

When using spacy pretrain, the model is saved only after every epoch. But each epoch can be very big since pretrain is used for language modeling tasks. So I added a --save-every option in the CLI to save after every --save-every batches.

Note the difference between epoch and batch ! So what the --n-save-every option does, is make ADDITIONAL intermediate temp models after every X batches WITHIN an epoch.

The relevant code is this:

    for epoch in range(epoch_start, n_iter + epoch_start):
        for batch_id, batch in ... :
            ...
            if n_save_every and (batch_id % n_save_every == 0):
                _save_model(epoch, is_temp=True)
        _save_model(epoch)

The naming of the option seems confusing though... I think we should add an additional option to support your use-case.

@svlandeg svlandeg added enhancement Feature requests and improvements and removed bug Bugs and behaviour differing from documentation labels Apr 10, 2020
@chopeen
Copy link
Contributor Author

chopeen commented Apr 10, 2020

All clear now, thank you!

I noticed a typo in documentation for preview, so I submitted a PR (#5293).

@svlandeg svlandeg added the help wanted Contributions welcome! label Apr 11, 2020
@svlandeg
Copy link
Member

Happy to hear the confusion has been cleared out!

I still think it might be an interesting addition to have an option that does what you were originally looking for - i.e. store a model only every X iterations. If you (or anyone else) feels like contributing with a PR, that would be most welcome!

@chopeen
Copy link
Contributor Author

chopeen commented Apr 14, 2020

Please assign the task to me. I will give a try during the weekend.

I'm wondering if option like --keep-only-when-better wouldn't make more sense, to keep a model every time loss reaches a new low.

@svlandeg
Copy link
Member

Hey @chopeen, great if you want to give it a go! We don't really officially assign tasks to anyone, but nobody else is working on it right now, so you can definitely give it a shot!

@svlandeg
Copy link
Member

I agree that that would be a useful option: it would save disk space. You may still get a lot of models in the beginning of the training though, because usually the loss keeps dropping consistently in the first dozens of iterations at least. But you could give it a try and see how it works out.

@svlandeg
Copy link
Member

@chopeen : I don't know wheter you've had a chance to look into this yet, but Issue #3584 and the comment here are relevant: it's probably indeed a good idea to only save the best models.

@chopeen
Copy link
Contributor Author

chopeen commented May 12, 2020

@svlandeg Keeping N best models is definitely a better idea than randomly saving every n-th model.

I reviewed the code a few weeks ago to see where to implement the change, but then I got swamped at work. Until the lock-down is over, this idea will need to sit on a back burner.

@svlandeg
Copy link
Member

That's OK, it would be a nice-to-have feature but I don't think it's urgent ;-)

@svlandeg
Copy link
Member

svlandeg commented Aug 20, 2020

This will be fixed in spaCy v.3 onwards, which will only save one best, and one final model.

[UPDATE]: I think the above comment wasn't entirely correct. One best, final model is saved for normal training, not for pretraining. However, this PR should address the original issue discussed in this thread.

@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements feat / cli Feature: Command-line interface feat / serialize Feature: Serialization, saving and loading feat / tok2vec Feature: Token-to-vector layer and pretraining help wanted Contributions welcome!
Projects
None yet
Development

No branches or pull requests

2 participants