Pretraining with option --n-save-every still saves all models #5280

chopeen · 2020-04-09T09:00:07Z

I am running the following command:

!spacy pretrain $FILE_RF_SENTENCES en_core_sci_lg $DIR_MODELS_RF_SENT \
    --use-vectors --n-save-every 5

In order to use less space, I specified the option --n-save-every to save a model every X batches.

However, all models are still saved, with additional .temp.bin files:

$ ls -al /kaggle/working/models/tok2vec_rf_sent_sci
[snip]
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model110.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model110.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model111.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model111.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model112.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model112.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model113.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model113.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model114.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model114.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model115.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model115.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model116.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model116.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model117.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model117.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model118.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model118.temp.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model119.bin
-rw-r--r-- 1 root root 3889626 Apr  9 08:49 model119.temp.bin
[snip]

Info about spaCy

spaCy version: 2.2.4
Platform: Linux-4.19.112+-x86_64-with-debian-9.9
Python version: 3.6.6

Notebook

https://www.kaggle.com/chopeen/spacy-with-gpu-support/

The text was updated successfully, but these errors were encountered:

svlandeg · 2020-04-10T10:00:54Z

Thanks for the report! I can replicate this - will look into it.

svlandeg · 2020-04-10T10:13:03Z

Hm, it looks like this is actually the intended behaviour: #3510

When using spacy pretrain, the model is saved only after every epoch. But each epoch can be very big since pretrain is used for language modeling tasks. So I added a --save-every option in the CLI to save after every --save-every batches.

Note the difference between epoch and batch ! So what the --n-save-every option does, is make ADDITIONAL intermediate temp models after every X batches WITHIN an epoch.

The relevant code is this:

    for epoch in range(epoch_start, n_iter + epoch_start):
        for batch_id, batch in ... :
            ...
            if n_save_every and (batch_id % n_save_every == 0):
                _save_model(epoch, is_temp=True)
        _save_model(epoch)

The naming of the option seems confusing though... I think we should add an additional option to support your use-case.

chopeen · 2020-04-10T22:00:59Z

All clear now, thank you!

I noticed a typo in documentation for preview, so I submitted a PR (#5293).

svlandeg · 2020-04-11T21:40:16Z

Happy to hear the confusion has been cleared out!

I still think it might be an interesting addition to have an option that does what you were originally looking for - i.e. store a model only every X iterations. If you (or anyone else) feels like contributing with a PR, that would be most welcome!

chopeen · 2020-04-14T18:51:52Z

Please assign the task to me. I will give a try during the weekend.

I'm wondering if option like --keep-only-when-better wouldn't make more sense, to keep a model every time loss reaches a new low.

svlandeg · 2020-04-14T19:53:12Z

Hey @chopeen, great if you want to give it a go! We don't really officially assign tasks to anyone, but nobody else is working on it right now, so you can definitely give it a shot!

svlandeg · 2020-04-14T19:54:36Z

I agree that that would be a useful option: it would save disk space. You may still get a lot of models in the beginning of the training though, because usually the loss keeps dropping consistently in the first dozens of iterations at least. But you could give it a try and see how it works out.

svlandeg · 2020-05-11T18:03:11Z

@chopeen : I don't know wheter you've had a chance to look into this yet, but Issue #3584 and the comment here are relevant: it's probably indeed a good idea to only save the best models.

chopeen · 2020-05-12T09:27:03Z

@svlandeg Keeping N best models is definitely a better idea than randomly saving every n-th model.

I reviewed the code a few weeks ago to see where to implement the change, but then I got swamped at work. Until the lock-down is over, this idea will need to sit on a back burner.

svlandeg · 2020-05-12T09:42:38Z

That's OK, it would be a nice-to-have feature but I don't think it's urgent ;-)

svlandeg · 2020-08-20T14:38:22Z

This will be fixed in spaCy v.3 onwards, which will only save one best, and one final model.

[UPDATE]: I think the above comment wasn't entirely correct. One best, final model is saved for normal training, not for pretraining. However, this PR should address the original issue discussed in this thread.

github-actions · 2021-10-20T00:01:56Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added feat / cli Feature: Command-line interface feat / serialize Feature: Serialization, saving and loading feat / tok2vec Feature: Token-to-vector layer and pretraining bug Bugs and behaviour differing from documentation labels Apr 9, 2020

svlandeg added enhancement Feature requests and improvements and removed bug Bugs and behaviour differing from documentation labels Apr 10, 2020

svlandeg added the help wanted Contributions welcome! label Apr 11, 2020

svlandeg mentioned this issue May 11, 2020

Add a parameter for saving only the n latest epochs while using CLI Training #3584

Closed

svlandeg closed this as completed Aug 20, 2020

thomashacker mentioned this issue Aug 9, 2021

Add new parameter for saving every n epoch in pretraining #8912

Merged

3 tasks

github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretraining with option --n-save-every still saves all models #5280

Pretraining with option --n-save-every still saves all models #5280

chopeen commented Apr 9, 2020

svlandeg commented Apr 10, 2020

svlandeg commented Apr 10, 2020 •

edited

Loading

chopeen commented Apr 10, 2020

svlandeg commented Apr 11, 2020

chopeen commented Apr 14, 2020

svlandeg commented Apr 14, 2020

svlandeg commented Apr 14, 2020

svlandeg commented May 11, 2020

chopeen commented May 12, 2020

svlandeg commented May 12, 2020

svlandeg commented Aug 20, 2020 •

edited

Loading

github-actions bot commented Oct 20, 2021

Pretraining with option --n-save-every still saves all models #5280

Pretraining with option --n-save-every still saves all models #5280

Comments

chopeen commented Apr 9, 2020

Info about spaCy

Notebook

svlandeg commented Apr 10, 2020

svlandeg commented Apr 10, 2020 • edited Loading

chopeen commented Apr 10, 2020

svlandeg commented Apr 11, 2020

chopeen commented Apr 14, 2020

svlandeg commented Apr 14, 2020

svlandeg commented Apr 14, 2020

svlandeg commented May 11, 2020

chopeen commented May 12, 2020

svlandeg commented May 12, 2020

svlandeg commented Aug 20, 2020 • edited Loading

github-actions bot commented Oct 20, 2021

svlandeg commented Apr 10, 2020 •

edited

Loading

svlandeg commented Aug 20, 2020 •

edited

Loading