CLI spacy train fails with large amount of data #4823

kormilitzin · 2019-12-19T17:04:25Z

I am training NER model with 7 categories and the data set contains 200K examples (texts) with average 60K annotated spans per category. However spacy train fails if I use all data. When I randomly subsample, then it works normally. The error I receive when use all data:

$ python -m spacy train en ....

Training pipeline: ['ner']
Starting with blank model 'en'
Counting training words (limit=0)
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/mnt/sdf/andrey_work/spacy/lib/python3.6/site-packages/spacy/main.py", line 33, in
plac.call(commands[command], sys.argv[1:])
File "/mnt/sdf/andrey_work/spacy/lib/python3.6/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/mnt/sdf/andrey_work/spacy/lib/python3.6/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/mnt/sdf/andrey_work/spacy/lib/python3.6/site-packages/spacy/cli/train.py", line 230, in train
corpus = GoldCorpus(train_path, dev_path, limit=n_examples)
File "gold.pyx", line 224, in spacy.gold.GoldCorpus.init
File "gold.pyx", line 235, in spacy.gold.GoldCorpus.write_msgpack
File "gold.pyx", line 280, in read_tuples
File "gold.pyx", line 545, in read_json_file
File "gold.pyx", line 592, in _json_iterate
OverflowError: value too large to convert to int

Is there any way to overcome this problem? Thanks.

The text was updated successfully, but these errors were encountered:

svlandeg · 2019-12-20T08:53:56Z

That does look like spaCy is crashing on the large training file. Could you provide a little more information to help us look into this:

the exact command you ran
which spaCy version you're using (from source or installed with pip / conda ? which version number ?)
how large (in MB or GB) your training file is on disk

adrianeboyd · 2019-12-21T08:40:16Z

This is a duplicate of #4703. I guess we should add a useful warning and there's really no reason not to change it to long.

ines · 2019-12-21T12:07:26Z

Merging this with #4703!

lock · 2020-01-24T12:01:43Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added bug Bugs and behaviour differing from documentation training Training and updating models labels Dec 19, 2019

svlandeg added the feat / cli Feature: Command-line interface label Dec 20, 2019

ines closed this as completed Dec 21, 2019

svlandeg mentioned this issue Dec 21, 2019

facilitate larger training files #4827

Merged

3 tasks

lock bot locked as resolved and limited conversation to collaborators Jan 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI spacy train fails with large amount of data #4823

CLI spacy train fails with large amount of data #4823

kormilitzin commented Dec 19, 2019

svlandeg commented Dec 20, 2019 •

edited

Loading

adrianeboyd commented Dec 21, 2019

ines commented Dec 21, 2019

lock bot commented Jan 24, 2020

CLI spacy train fails with large amount of data #4823

CLI spacy train fails with large amount of data #4823

Comments

kormilitzin commented Dec 19, 2019

svlandeg commented Dec 20, 2019 • edited Loading

adrianeboyd commented Dec 21, 2019

ines commented Dec 21, 2019

lock bot commented Jan 24, 2020

svlandeg commented Dec 20, 2019 •

edited

Loading