Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An error while training the dependency parser #4402

Closed
ryszardtuora opened this issue Oct 8, 2019 · 4 comments · Fixed by #4516
Closed

An error while training the dependency parser #4402

ryszardtuora opened this issue Oct 8, 2019 · 4 comments · Fixed by #4516
Labels
bug Bugs and behaviour differing from documentation feat / cli Feature: Command-line interface training Training and updating models

Comments

@ryszardtuora
Copy link

Ever since the 2.2 update I've been having trouble training parser models. Training both taggers and NER is fine, but with parsers I get an error:

Training pipeline: ['parser']
Starting with blank model 'pl'
Loading vector from model 'vocab_kgr_100_handpruned22'
Counting training words (limit=0)

Itn  Dep Loss    UAS     LAS    Token %  CPU WPS
---  ---------  ------  ------  -------  -------
✔ Saved model to output directory                                                                                                             
base_parser_22/model-final
⠴ Creating best model...
Traceback (most recent call last):
  File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 356, in train
    for batch in util.minibatch_by_words(train_docs, size=batch_sizes):
  File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/util.py", line 569, in minibatch_by_words
    doc, gold = next(items)
  File "gold.pyx", line 222, in train_docs
  File "gold.pyx", line 240, in iter_gold_docs
  File "gold.pyx", line 258, in spacy.gold.GoldCorpus._make_docs
ValueError: need more than 0 values to unpack

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/__main__.py", line 35, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/rtuora/.local/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/rtuora/.local/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 486, in train
    best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
  File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 554, in _collate_best_model
    path2str(best_component_src / component), path2str(best_dest / component)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'

How to reproduce the behaviour

I've converted the data with spaCy 2.2 converter. I run the train command with this input:
nice -n 19 python3 -m spacy train pl base_parser_22 LFG22/pl_lfg-ud-train.json LFG22/pl_lfg-ud-dev.json --pipeline parser --vectors vocab_kgr_100_handpruned22 --n-iter 50 --n-early-stopping 10 -G

I've tried this with other treebanks and the error persists. I do not recall encountering anything like it in the previous version.

Your Environment

  • spaCy version: 2.2.1
  • Platform: Linux-4.15.0-58-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.8
@honnibal
Copy link
Member

honnibal commented Oct 8, 2019

Thanks for the report, I've hit this as well.

See #4392 . We're still working on the fix, because my patch in that PR currently breaks the textcat CLI stuff added in v2.2

@ryszardtuora
Copy link
Author

Oh, thank you. I've been using the -G option because of a substantial number of nonprojective trees in my corpus, but without this parameter it still works, whereas earlier I remember it malfunctioning, and suggesting to use the --gold-preproc. Correct me if I'm wrong, but now that I read about this option, it seems not to have much to do with projectivity, but rather with keeping the original tokenization. Actually this is also of relevance because I've been experimenting with the idea of including a third party tokenizer which does a better job than the existing ruleset for polish. Is there some built in support for training the models on data tokenized in this way?

@svlandeg svlandeg added bug Bugs and behaviour differing from documentation feat / cli Feature: Command-line interface training Training and updating models labels Oct 8, 2019
@adrianeboyd
Copy link
Contributor

-G keeps the original tokenization rather than trying to align spacy's rule-based tokenization with the provided tokenization. A model trained with -G would not work as well with the built-in tokenization, so you'd probably want to provide the external tokenization when building a Doc and call the other spacy pipeline components a little differently.

If you want to build the other tokenizer into the spacy pipeline so you can just call nlp() as before, you would need to modify Polish to define a tokenizer that loads and calls the third-party tokenizer. You can look at some of the languages that require more complicated word segmentation and rely on external libraries like ja for an example of how this can be done.

If you think it wouldn't be too complicated to improve spacy's tokenizer, we would be happy to get contributions in this area! It looks like the Polish tokenizer has a lot of exceptions for abbreviations but otherwise isn't very different from English at this point in terms of other punctuation.

@lock
Copy link

lock bot commented Nov 26, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Nov 26, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / cli Feature: Command-line interface training Training and updating models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants