Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using spacy.gold.docs_to_json as input to spacy train #3813

Closed
ophiry opened this issue Jun 3, 2019 · 4 comments
Closed

Error when using spacy.gold.docs_to_json as input to spacy train #3813

ophiry opened this issue Jun 3, 2019 · 4 comments
Labels
training Training and updating models usage General spaCy usage

Comments

@ophiry
Copy link

ophiry commented Jun 3, 2019

I'm trying to train a custom NER model using data generated with the spacy.gold.docs_to_json command.
When trying to run the training, I get the error:

corpus = GoldCorpus(train_path, dev_path, limit=n_examples)

File "gold.pyx", line 112, in spacy.gold.GoldCorpus.init
File "gold.pyx", line 125, in spacy.gold.GoldCorpus.write_msgpack
KeyError: 1

Code to reproduce:

import subprocess
import jsonlines
import spacy


nlp = spacy.load('en_core_web_lg')
labels = ['OBJECT', 'COLOR']
data = [
    {'text': 'the sky is blue', 'patterns': [['sky', 'OBJECT'],['blue', 'COLOR']]},
    {'text': 'oranges  are orange', 'patterns': [['oranges', 'OBJECT'],['orange', 'COLOR']]}
]

# this should add the new labels to the base model used for training
ner = nlp.get_pipe("ner")
for label in labels:
    ner.add_label(label)
nlp.to_disk('/tmp/ner_base')


with jsonlines.open('/tmp/train_set.jsonl', 'w') as fh:
    for record in data:
        patterns = [{'label': label, 'pattern': [{'lower': pattern.lower()}]} for pattern, label in record['patterns'] ]
        er = spacy.pipeline.EntityRuler(nlp, patterns=patterns)
        doc = er(nlp(record['text']))
        fh.write(spacy.gold.docs_to_json(doc))

subprocess.run('python -m spacy train en /tmp/ner_out /tmp/train_set.jsonl /tmp/train_set.jsonl -b en_core_web_lg -p ner -VV', shell=True)


Your Environment

Info about spaCy

  • spaCy version: 2.1.4
  • Platform: Darwin-18.6.0-x86_64-i386-64bit
  • Python version: 3.7.3
@ines ines added training Training and updating models usage General spaCy usage labels Jun 3, 2019
@ines
Copy link
Member

ines commented Jun 3, 2019

Sorry that the training format stuff is still a bit messy – we're working on that (see #2928).

I think the problem here might be that you're writing the result to a JSONL file instead of a JSON file in spaCy's format? If I import json, use a .json file and change your conversion logic to the following, it works as expected for me:

result = []
for i, record in enumerate(data):
    patterns = [{'label': label, 'pattern': [{'lower': pattern.lower()}]} for pattern, label in record['patterns'] ]
    er = spacy.pipeline.EntityRuler(nlp, patterns=patterns)
    doc = er(nlp(record['text']))
    result.append(spacy.gold.docs_to_json(doc, id=i))

with open('/tmp/train_set.json', 'w') as fh:
    fh.write(json.dumps(result))

@ophiry
Copy link
Author

ophiry commented Jun 3, 2019

Thanks, this appears to work.
does the id need to be distinct also across files? (when using a directory of train files)

@ines
Copy link
Member

ines commented Jun 3, 2019

does the id need to be distinct also across files? (when using a directory of train files)

Yes, within the same training run, the IDs should be unique. They don't have to be sequential and can be pretty much anything – so you could also use a hash of the text or something like that.

@ines ines closed this as completed Jun 3, 2019
@lock
Copy link

lock bot commented Jul 3, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jul 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
training Training and updating models usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

2 participants