Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI spacy train fails with large amount of data #4823

Closed
kormilitzin opened this issue Dec 19, 2019 · 4 comments · Fixed by #4827
Closed

CLI spacy train fails with large amount of data #4823

kormilitzin opened this issue Dec 19, 2019 · 4 comments · Fixed by #4827
Labels
bug Bugs and behaviour differing from documentation feat / cli Feature: Command-line interface training Training and updating models

Comments

@kormilitzin
Copy link

I am training NER model with 7 categories and the data set contains 200K examples (texts) with average 60K annotated spans per category. However spacy train fails if I use all data. When I randomly subsample, then it works normally. The error I receive when use all data:

$ python -m spacy train en ....

Training pipeline: ['ner']
Starting with blank model 'en'
Counting training words (limit=0)
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/mnt/sdf/andrey_work/spacy/lib/python3.6/site-packages/spacy/main.py", line 33, in
plac.call(commands[command], sys.argv[1:])
File "/mnt/sdf/andrey_work/spacy/lib/python3.6/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/mnt/sdf/andrey_work/spacy/lib/python3.6/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/mnt/sdf/andrey_work/spacy/lib/python3.6/site-packages/spacy/cli/train.py", line 230, in train
corpus = GoldCorpus(train_path, dev_path, limit=n_examples)
File "gold.pyx", line 224, in spacy.gold.GoldCorpus.init
File "gold.pyx", line 235, in spacy.gold.GoldCorpus.write_msgpack
File "gold.pyx", line 280, in read_tuples
File "gold.pyx", line 545, in read_json_file
File "gold.pyx", line 592, in _json_iterate
OverflowError: value too large to convert to int

Is there any way to overcome this problem? Thanks.

@svlandeg svlandeg added bug Bugs and behaviour differing from documentation training Training and updating models labels Dec 19, 2019
@svlandeg
Copy link
Member

svlandeg commented Dec 20, 2019

That does look like spaCy is crashing on the large training file. Could you provide a little more information to help us look into this:

  • the exact command you ran
  • which spaCy version you're using (from source or installed with pip / conda ? which version number ?)
  • how large (in MB or GB) your training file is on disk

@svlandeg svlandeg added the feat / cli Feature: Command-line interface label Dec 20, 2019
@adrianeboyd
Copy link
Contributor

This is a duplicate of #4703. I guess we should add a useful warning and there's really no reason not to change it to long.

@ines
Copy link
Member

ines commented Dec 21, 2019

Merging this with #4703!

@lock
Copy link

lock bot commented Jan 24, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / cli Feature: Command-line interface training Training and updating models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants