Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No valid 'lang' when creating blank model with vocab from file #4054

Closed
svlandeg opened this issue Jul 31, 2019 · 3 comments
Closed

No valid 'lang' when creating blank model with vocab from file #4054

svlandeg opened this issue Jul 31, 2019 · 3 comments
Labels
bug Bugs and behaviour differing from documentation feat / serialize Feature: Serialization, saving and loading usage General spaCy usage

Comments

@svlandeg
Copy link
Member

How to reproduce the behaviour

Previously, I saved en_core_web_lg to file, creating a vocab subdirectory, and its meta.json which reads (among other things)

"lang":"en",
"name":"core_web_lg"

Now, I'm attempting to use the vocab subdir as source for a blank model:

vocab = Vocab().from_disk(vocab_dir)
nlp = spacy.blank("en", vocab=vocab)
print(nlp("This is a test sentence"))
nlp.to_disk(output_dir)
nlp2 = spacy.load(output_dir)
print(nlp2("This is another test sentence"))

Which fails, giving the error ValueError: [E054] No valid 'lang' setting found in model meta.json.

And indeed, the meta.json of the new nlp object reads

"lang":""

Is this expected behaviour? When I run this in a unit test and replace the second line with

nlp = spacy.blank("en", vocab=en_vocab)

it does work correctly.

Your Environment

  • spaCy version: 2.1.6
  • Platform: Windows-10-10.0.17763-SP0
  • Python version: 3.6.7
@ines ines added the feat / serialize Feature: Serialization, saving and loading label Jul 31, 2019
@ines
Copy link
Member

ines commented Jul 31, 2019

I think the problem might be related to this:

spaCy/spacy/language.py

Lines 175 to 176 in 23ec07d

def meta(self):
self._meta.setdefault("lang", self.vocab.lang)

When you call Vocab().from_disk, it loads in the data, but it does that all on top of a blank Vocab. The language-specific Vocab is created via the create_vocab classmethod of the given Language defaults. The blank English model has that, but it's then overwritten by the vocab you pass in.

I do agree that this is slightly confusing, and I'm not immediately sure what the solution should be. In general, for a use case like this, the recommended best practice would probably be to remove all the pipeline components you don't want, which will give you the desired result: a base model with the vocab.

@svlandeg
Copy link
Member Author

Ok, that makes sense. Thanks for the detailed explanation :-)

I was testing this in the context of serializing the KnowledgeBase for the entity linking functionality. To read the KB back in, you need at least the original vocab object, or otherwise the corresponding nlp object you used to create the KB with.

Because you have the option of loading the KB back in with the vocab only, you can then also create a blank English model with this vocab and create a new entity_linking pipe and have a functional pipeline, at least for demonstration purposes (in practice you'd need ner too, ofcourse).

See here: https:/svlandeg/spaCy/blob/feature/el-docs/examples/training/train_entity_linker.py

This works, but the IO doesn't because of the reasons described above.

But if you think that use-case is not relevant, we can just assume we always have access to the original nlp object with the right vocab...

@svlandeg svlandeg added the usage General spaCy usage label Jul 31, 2019
@ines ines added the bug Bugs and behaviour differing from documentation label Aug 1, 2019
@ines ines closed this as completed Aug 1, 2019
@lock
Copy link

lock bot commented Aug 31, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 31, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / serialize Feature: Serialization, saving and loading usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

2 participants