Error when using `spacy.gold.docs_to_json` as input to `spacy train` #3813

ophiry · 2019-06-03T10:25:17Z

I'm trying to train a custom NER model using data generated with the spacy.gold.docs_to_json command.
When trying to run the training, I get the error:

corpus = GoldCorpus(train_path, dev_path, limit=n_examples)
File "gold.pyx", line 112, in spacy.gold.GoldCorpus.init
File "gold.pyx", line 125, in spacy.gold.GoldCorpus.write_msgpack
KeyError: 1

Code to reproduce:

import subprocess
import jsonlines
import spacy


nlp = spacy.load('en_core_web_lg')
labels = ['OBJECT', 'COLOR']
data = [
    {'text': 'the sky is blue', 'patterns': [['sky', 'OBJECT'],['blue', 'COLOR']]},
    {'text': 'oranges  are orange', 'patterns': [['oranges', 'OBJECT'],['orange', 'COLOR']]}
]

# this should add the new labels to the base model used for training
ner = nlp.get_pipe("ner")
for label in labels:
    ner.add_label(label)
nlp.to_disk('/tmp/ner_base')


with jsonlines.open('/tmp/train_set.jsonl', 'w') as fh:
    for record in data:
        patterns = [{'label': label, 'pattern': [{'lower': pattern.lower()}]} for pattern, label in record['patterns'] ]
        er = spacy.pipeline.EntityRuler(nlp, patterns=patterns)
        doc = er(nlp(record['text']))
        fh.write(spacy.gold.docs_to_json(doc))

subprocess.run('python -m spacy train en /tmp/ner_out /tmp/train_set.jsonl /tmp/train_set.jsonl -b en_core_web_lg -p ner -VV', shell=True)

Your Environment

Info about spaCy

spaCy version: 2.1.4
Platform: Darwin-18.6.0-x86_64-i386-64bit
Python version: 3.7.3

The text was updated successfully, but these errors were encountered:

ines · 2019-06-03T11:19:02Z

Sorry that the training format stuff is still a bit messy – we're working on that (see #2928).

I think the problem here might be that you're writing the result to a JSONL file instead of a JSON file in spaCy's format? If I import json, use a .json file and change your conversion logic to the following, it works as expected for me:

result = []
for i, record in enumerate(data):
    patterns = [{'label': label, 'pattern': [{'lower': pattern.lower()}]} for pattern, label in record['patterns'] ]
    er = spacy.pipeline.EntityRuler(nlp, patterns=patterns)
    doc = er(nlp(record['text']))
    result.append(spacy.gold.docs_to_json(doc, id=i))

with open('/tmp/train_set.json', 'w') as fh:
    fh.write(json.dumps(result))

ophiry · 2019-06-03T13:55:12Z

Thanks, this appears to work.
does the id need to be distinct also across files? (when using a directory of train files)

ines · 2019-06-03T16:26:15Z

does the id need to be distinct also across files? (when using a directory of train files)

Yes, within the same training run, the IDs should be unique. They don't have to be sequential and can be pretty much anything – so you could also use a hash of the text or something like that.

lock · 2019-07-03T18:27:10Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added training Training and updating models usage General spaCy usage labels Jun 3, 2019

ines closed this as completed Jun 3, 2019

lock bot locked as resolved and limited conversation to collaborators Jul 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when using `spacy.gold.docs_to_json` as input to `spacy train` #3813

Error when using `spacy.gold.docs_to_json` as input to `spacy train` #3813

ophiry commented Jun 3, 2019

ines commented Jun 3, 2019

ophiry commented Jun 3, 2019

ines commented Jun 3, 2019

lock bot commented Jul 3, 2019

Error when using spacy.gold.docs_to_json as input to spacy train #3813

Error when using spacy.gold.docs_to_json as input to spacy train #3813

Comments

ophiry commented Jun 3, 2019

Your Environment

Info about spaCy

ines commented Jun 3, 2019

ophiry commented Jun 3, 2019

ines commented Jun 3, 2019

lock bot commented Jul 3, 2019

Error when using `spacy.gold.docs_to_json` as input to `spacy train` #3813

Error when using `spacy.gold.docs_to_json` as input to `spacy train` #3813