Add `--latest` option in `spacy train` to save latest n epochs of CLI training #3586

bharatr21 · 2019-04-12T11:27:34Z

Close #3584 by adding an optional parameter --latest

Description

This parameter --latest saves only the latest n epochs trained by the model. This could be more beneficial since the model improves with training. 0 or a negative integer enables saving all epochs, preserving the defaults.

All tests pass except for this warning:

spaCy/bin/ud/ud_train.py:45
  /home/user/spaCy/bin/ud/ud_train.py:45: DeprecationWarning: invalid escape sequence \s
    space_re = re.compile("\s+")

Also refer #3510 for a modification of this idea (save model after every n epochs) in spacy pretrain

Types of change

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

…epochs

honnibal · 2019-04-15T10:07:45Z

@Bharat123rox I definitely agree it's annoying the way the disk saving works at the moment. I wonder whether this is the best solution though.

Perhaps it would be better to retain the best model for each component? We can always write out the latest one, and then if the score is better for the parser, we retain the parser model, if the score is better for the tagger, we retain the tagger model, etc. This would prevent the disk from filling up, while being a bit more satisfying than only using the latest iterations, which might not be the best ones.

bharatr21 · 2019-04-17T04:48:14Z

@honnibal Can I implement this next week as I will be away from work for the rest of this week (and also, I'm still struggling to find how to extract the metrics for each epoch at the moment, but I will figure it out soon)

ines

Looks good!

I propose one small change: I think it'd make sense to call this --n-latest instead of --latest? This makes it consistent with --n-iter etc. and it's immediately clear that the expected value is a number.

ines · 2019-04-17T09:24:15Z

website/docs/api/cli.md

@@ -198,7 +198,7 @@ will only train the tagger and parser.

 ```bash
 $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
-[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-examples] [--use-gpu]
+[--base-model] [--pipeline] [--vectors] [--n-iter] [--latest] [--n-examples] [--use-gpu]


Suggested change

[--base-model] [--pipeline] [--vectors] [--n-iter] [--latest] [--n-examples] [--use-gpu]

[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-latest] [--n-examples] [--use-gpu]

ines · 2019-04-17T09:24:28Z

website/docs/api/cli.md

@@ -213,7 +213,8 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
 | `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
 | `--pipeline`, `-p` <Tag variant="new">2.1</Tag> | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
 | `--vectors`, `-v` | option | Model to load vectors from. |
-| `--n-iter`, `-n` | option | Number of iterations (default: `30`). |
+| `--n-iter`, `-n` | option | Number of iterations (default: `30`).|
+| `--latest`, `-l` | option | Number of epochs to save from the latest epoch (`0` or a negative integer to save all epochs, default: `0`).|


ines · 2019-04-17T09:24:42Z

spacy/cli/train.py

@@ -36,6 +36,7 @@
 vectors=("Model to load vectors from", "option", "v", str),
 n_iter=("Number of iterations", "option", "n", int),
 n_examples=("Number of examples", "option", "ns", int),
+ latest=("Number of epochs to save from the latest epoch", "option", "l", int),


Suggested change

latest=("Number of epochs to save from the latest epoch", "option", "l", int),

n_latest=("Number of epochs to save from the latest epoch", "option", "l", int),

ines · 2019-04-17T09:24:51Z

spacy/cli/train.py

@@ -74,6 +75,7 @@ def train(
 pipeline="tagger,parser,ner",
 vectors=None,
 n_iter=30,
+ latest=0,


Suggested change

latest=0,

n_latest=0,

ines · 2019-04-17T09:25:04Z

spacy/cli/train.py

@@ -328,6 +330,10 @@ def train(
 gpu_wps=gpu_wps,
 )
 msg.row(progress, **row_settings)
+
+ if latest > 0 and i >= latest:


Suggested change

if latest > 0 and i >= latest:

if n_latest > 0 and i >= n_latest:

ines · 2019-04-17T09:26:04Z

spacy/cli/train.py

@@ -328,6 +330,10 @@ def train(
 gpu_wps=gpu_wps,
 )
 msg.row(progress, **row_settings)
+
+ if latest > 0 and i >= latest:
+ shutil.rmtree(output_path / ("model%d" % (i - latest)))


Suggested change

shutil.rmtree(output_path / ("model%d" % (i - latest)))

shutil.rmtree(output_path / ("model%d" % (i - n_latest)))

Also, if I remember correctly, shutil.rmtree takes a string, not a path? I think I fixed some issue recently where this caused a warning. We have a path2str helper in spacy.compat that you can use for this.

Okay, nevermind, I think it's probably best to work around this and not have the --latest argument at all.

bharatr21 added 2 commits April 12, 2019 16:49

Add --latest option in spacy train CLI to save only the latest n …

546d945

…epochs

Fix indentation issue

1a9c10c

ines added enhancement Feature requests and improvements feat / cli Feature: Command-line interface training Training and updating models labels Apr 16, 2019

ines previously requested changes Apr 17, 2019

View reviewed changes

honnibal closed this May 3, 2019

svlandeg mentioned this pull request May 11, 2020

Pretraining with option --n-save-every still saves all models #5280

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `--latest` option in `spacy train` to save latest n epochs of CLI training #3586

Add `--latest` option in `spacy train` to save latest n epochs of CLI training #3586

bharatr21 commented Apr 12, 2019 •

edited

Loading

honnibal commented Apr 15, 2019

bharatr21 commented Apr 17, 2019

ines left a comment

ines Apr 17, 2019

ines Apr 17, 2019

ines Apr 17, 2019

ines Apr 17, 2019

ines Apr 17, 2019

ines Apr 17, 2019

	[--base-model] [--pipeline] [--vectors] [--n-iter] [--latest] [--n-examples] [--use-gpu]
	[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-latest] [--n-examples] [--use-gpu]

	\| `--latest`, `-l` \| option \| Number of epochs to save from the latest epoch (`0` or a negative integer to save all epochs, default: `0`).\|
	\| `--n-latest`, `-l` \| option \| Number of epochs to save from the latest epoch (`0` or a negative integer to save all epochs, default: `0`).\|

	latest=("Number of epochs to save from the latest epoch", "option", "l", int),
	n_latest=("Number of epochs to save from the latest epoch", "option", "l", int),

	if latest > 0 and i >= latest:
	if n_latest > 0 and i >= n_latest:

	shutil.rmtree(output_path / ("model%d" % (i - latest)))
	shutil.rmtree(output_path / ("model%d" % (i - n_latest)))

Add --latest option in spacy train to save latest n epochs of CLI training #3586

Add --latest option in spacy train to save latest n epochs of CLI training #3586

Conversation

bharatr21 commented Apr 12, 2019 • edited Loading

Description

Types of change

Checklist

honnibal commented Apr 15, 2019

bharatr21 commented Apr 17, 2019

ines left a comment

Choose a reason for hiding this comment

ines Apr 17, 2019

Choose a reason for hiding this comment

ines Apr 17, 2019

Choose a reason for hiding this comment

ines Apr 17, 2019

Choose a reason for hiding this comment

ines Apr 17, 2019

Choose a reason for hiding this comment

ines Apr 17, 2019

Choose a reason for hiding this comment

ines Apr 17, 2019

Choose a reason for hiding this comment

Add `--latest` option in `spacy train` to save latest n epochs of CLI training #3586

Add `--latest` option in `spacy train` to save latest n epochs of CLI training #3586

bharatr21 commented Apr 12, 2019 •

edited

Loading