Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --latest option in spacy train to save latest n epochs of CLI training #3586

Closed
wants to merge 2 commits into from

Conversation

bharatr21
Copy link
Contributor

@bharatr21 bharatr21 commented Apr 12, 2019

Close #3584 by adding an optional parameter --latest

Description

This parameter --latest saves only the latest n epochs trained by the model. This could be more beneficial since the model improves with training. 0 or a negative integer enables saving all epochs, preserving the defaults.

All tests pass except for this warning:

spaCy/bin/ud/ud_train.py:45
  /home/user/spaCy/bin/ud/ud_train.py:45: DeprecationWarning: invalid escape sequence \s
    space_re = re.compile("\s+")

Also refer #3510 for a modification of this idea (save model after every n epochs) in spacy pretrain

Types of change

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@honnibal
Copy link
Member

@Bharat123rox I definitely agree it's annoying the way the disk saving works at the moment. I wonder whether this is the best solution though.

Perhaps it would be better to retain the best model for each component? We can always write out the latest one, and then if the score is better for the parser, we retain the parser model, if the score is better for the tagger, we retain the tagger model, etc. This would prevent the disk from filling up, while being a bit more satisfying than only using the latest iterations, which might not be the best ones.

@ines ines added enhancement Feature requests and improvements feat / cli Feature: Command-line interface training Training and updating models labels Apr 16, 2019
@bharatr21
Copy link
Contributor Author

@honnibal Can I implement this next week as I will be away from work for the rest of this week (and also, I'm still struggling to find how to extract the metrics for each epoch at the moment, but I will figure it out soon)

ines
ines previously requested changes Apr 17, 2019
Copy link
Member

@ines ines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

I propose one small change: I think it'd make sense to call this --n-latest instead of --latest? This makes it consistent with --n-iter etc. and it's immediately clear that the expected value is a number.

@@ -198,7 +198,7 @@ will only train the tagger and parser.

```bash
$ python -m spacy train [lang] [output_path] [train_path] [dev_path]
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-examples] [--use-gpu]
[--base-model] [--pipeline] [--vectors] [--n-iter] [--latest] [--n-examples] [--use-gpu]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[--base-model] [--pipeline] [--vectors] [--n-iter] [--latest] [--n-examples] [--use-gpu]
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-latest] [--n-examples] [--use-gpu]

@@ -213,7 +213,8 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. |
| `--pipeline`, `-p` <Tag variant="new">2.1</Tag> | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. |
| `--vectors`, `-v` | option | Model to load vectors from. |
| `--n-iter`, `-n` | option | Number of iterations (default: `30`). |
| `--n-iter`, `-n` | option | Number of iterations (default: `30`).|
| `--latest`, `-l` | option | Number of epochs to save from the latest epoch (`0` or a negative integer to save all epochs, default: `0`).|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `--latest`, `-l` | option | Number of epochs to save from the latest epoch (`0` or a negative integer to save all epochs, default: `0`).|
| `--n-latest`, `-l` | option | Number of epochs to save from the latest epoch (`0` or a negative integer to save all epochs, default: `0`).|

@@ -36,6 +36,7 @@
vectors=("Model to load vectors from", "option", "v", str),
n_iter=("Number of iterations", "option", "n", int),
n_examples=("Number of examples", "option", "ns", int),
latest=("Number of epochs to save from the latest epoch", "option", "l", int),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
latest=("Number of epochs to save from the latest epoch", "option", "l", int),
n_latest=("Number of epochs to save from the latest epoch", "option", "l", int),

@@ -74,6 +75,7 @@ def train(
pipeline="tagger,parser,ner",
vectors=None,
n_iter=30,
latest=0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
latest=0,
n_latest=0,

@@ -328,6 +330,10 @@ def train(
gpu_wps=gpu_wps,
)
msg.row(progress, **row_settings)

if latest > 0 and i >= latest:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if latest > 0 and i >= latest:
if n_latest > 0 and i >= n_latest:

@@ -328,6 +330,10 @@ def train(
gpu_wps=gpu_wps,
)
msg.row(progress, **row_settings)

if latest > 0 and i >= latest:
shutil.rmtree(output_path / ("model%d" % (i - latest)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
shutil.rmtree(output_path / ("model%d" % (i - latest)))
shutil.rmtree(output_path / ("model%d" % (i - n_latest)))

Also, if I remember correctly, shutil.rmtree takes a string, not a path? I think I fixed some issue recently where this caused a warning. We have a path2str helper in spacy.compat that you can use for this.

@ines ines dismissed their stale review April 17, 2019 09:32

Okay, nevermind, I think it's probably best to work around this and not have the --latest argument at all.

@honnibal honnibal closed this May 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / cli Feature: Command-line interface training Training and updating models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a parameter for saving only the n latest epochs while using CLI Training
3 participants