An error while training the dependency parser #4402

ryszardtuora · 2019-10-08T13:36:25Z

Ever since the 2.2 update I've been having trouble training parser models. Training both taggers and NER is fine, but with parsers I get an error:

Training pipeline: ['parser']
Starting with blank model 'pl'
Loading vector from model 'vocab_kgr_100_handpruned22'
Counting training words (limit=0)

Itn  Dep Loss    UAS     LAS    Token %  CPU WPS
---  ---------  ------  ------  -------  -------
✔ Saved model to output directory                                                                                                             
base_parser_22/model-final
⠴ Creating best model...
Traceback (most recent call last):
  File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 356, in train
    for batch in util.minibatch_by_words(train_docs, size=batch_sizes):
  File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/util.py", line 569, in minibatch_by_words
    doc, gold = next(items)
  File "gold.pyx", line 222, in train_docs
  File "gold.pyx", line 240, in iter_gold_docs
  File "gold.pyx", line 258, in spacy.gold.GoldCorpus._make_docs
ValueError: need more than 0 values to unpack

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/__main__.py", line 35, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/rtuora/.local/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/rtuora/.local/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 486, in train
    best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
  File "/home/rtuora/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 554, in _collate_best_model
    path2str(best_component_src / component), path2str(best_dest / component)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'

How to reproduce the behaviour

I've converted the data with spaCy 2.2 converter. I run the train command with this input:
nice -n 19 python3 -m spacy train pl base_parser_22 LFG22/pl_lfg-ud-train.json LFG22/pl_lfg-ud-dev.json --pipeline parser --vectors vocab_kgr_100_handpruned22 --n-iter 50 --n-early-stopping 10 -G

I've tried this with other treebanks and the error persists. I do not recall encountering anything like it in the previous version.

Your Environment

spaCy version: 2.2.1
Platform: Linux-4.15.0-58-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.8

The text was updated successfully, but these errors were encountered:

honnibal · 2019-10-08T13:42:04Z

Thanks for the report, I've hit this as well.

See #4392 . We're still working on the fix, because my patch in that PR currently breaks the textcat CLI stuff added in v2.2

ryszardtuora · 2019-10-08T14:27:47Z

Oh, thank you. I've been using the -G option because of a substantial number of nonprojective trees in my corpus, but without this parameter it still works, whereas earlier I remember it malfunctioning, and suggesting to use the --gold-preproc. Correct me if I'm wrong, but now that I read about this option, it seems not to have much to do with projectivity, but rather with keeping the original tokenization. Actually this is also of relevance because I've been experimenting with the idea of including a third party tokenizer which does a better job than the existing ruleset for polish. Is there some built in support for training the models on data tokenized in this way?

adrianeboyd · 2019-10-09T09:04:57Z

-G keeps the original tokenization rather than trying to align spacy's rule-based tokenization with the provided tokenization. A model trained with -G would not work as well with the built-in tokenization, so you'd probably want to provide the external tokenization when building a Doc and call the other spacy pipeline components a little differently.

If you want to build the other tokenizer into the spacy pipeline so you can just call nlp() as before, you would need to modify Polish to define a tokenizer that loads and calls the third-party tokenizer. You can look at some of the languages that require more complicated word segmentation and rely on external libraries like ja for an example of how this can be done.

If you think it wouldn't be too complicated to improve spacy's tokenizer, we would be happy to get contributions in this area! It looks like the Polish tokenizer has a lot of exceptions for abbreviations but otherwise isn't very different from English at this point in terms of other punctuation.

lock · 2019-11-26T21:54:52Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added bug Bugs and behaviour differing from documentation feat / cli Feature: Command-line interface training Training and updating models labels Oct 8, 2019

adrianeboyd mentioned this issue Oct 17, 2019

Train NamedEntityRecognizer via cli throws OverflowError #4461

Closed

svlandeg mentioned this issue Oct 25, 2019

Match pop with append for training format #4516

Merged

3 tasks

honnibal closed this as completed in #4516 Oct 27, 2019

svlandeg reopened this Oct 27, 2019

svlandeg closed this as completed Oct 27, 2019

adrianeboyd mentioned this issue Oct 29, 2019

ValueError: need more than 0 values to unpack #4552

Closed

lock bot locked as resolved and limited conversation to collaborators Nov 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An error while training the dependency parser #4402

An error while training the dependency parser #4402

ryszardtuora commented Oct 8, 2019

honnibal commented Oct 8, 2019

ryszardtuora commented Oct 8, 2019

adrianeboyd commented Oct 9, 2019

lock bot commented Nov 26, 2019

An error while training the dependency parser #4402

An error while training the dependency parser #4402

Comments

ryszardtuora commented Oct 8, 2019

How to reproduce the behaviour

Your Environment

honnibal commented Oct 8, 2019

ryszardtuora commented Oct 8, 2019

adrianeboyd commented Oct 9, 2019

lock bot commented Nov 26, 2019