Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect POS tags when multiple models are loaded #3853

Closed
adrianeboyd opened this issue Jun 16, 2019 · 7 comments
Closed

Incorrect POS tags when multiple models are loaded #3853

adrianeboyd opened this issue Jun 16, 2019 · 7 comments
Labels
bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity models Issues related to the statistical models

Comments

@adrianeboyd
Copy link
Contributor

Something strange is happening when en_core_web_md and en_core_web_lg are loaded at the same time, which leads to many POS tagging errors in the model that was loaded first.

The weird tagging results mentioned in this comment turn out to be an issue when multiple models are loaded at the same time rather than a problem specific to en_core_web_md.

To reproduce:

import spacy

text = "Pompey took command of two legions in Capua and began to raise levies illegally."

nlp_md = spacy.load('en_core_web_md')

doc_before = nlp_md(text)

nlp_lg = spacy.load('en_core_web_lg')

doc_after = nlp_md(text)

nlp_md = spacy.load('en_core_web_md')

doc_reloaded = nlp_md(text)

for token_before, token_after, token_reloaded in zip(doc_before, doc_after, doc_reloaded):
    print("\t".join([token_before.text, token_before.tag_, token_after.tag_, token_reloaded.tag_]))

Output:

Pompey	NNP	JJ	NNP
took	VBD	NN	VBD
command	NN	NN	NN
of	IN	IN	IN
two	CD	PRP$	CD
legions	NNS	NNS	NNS
in	IN	IN	IN
Capua	NNP	NNP	NNP
and	CC	CC	CC
began	VBD	NN	VBD
to	TO	IN	TO
raise	VB	JJ	VB
levies	NNS	NNS	NNS
illegally	RB	RB	RB
.	.	.	.

Loading en_core_web_sm doesn't seem to cause similar problems, but loading en_core_web_md/en_core_web_lg in either order leads to many incorrect tags (plus obviously cascading errors in the rest of the pipeline) in the model that was loaded first.

Your Environment

  • spaCy version: 2.1.4
  • Platform: Linux-4.15.0-51-generic-x86_64-with-debian-stretch-sid
  • Python version: 3.6.8

spacy 2.0 doesn't seem to have this issue.

@ines ines added bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity models Issues related to the statistical models labels Jun 16, 2019
@ines
Copy link
Member

ines commented Jun 16, 2019

Thanks for the report! I think I know what might be happening here:

The md and lg models were both trained with vectors and also need those vectors at runtime to make predictions. To make it easier for Thinc (spaCy's machine learning library) to resolve IDs back to vectors, the vectors are referenced in a global lookup table, under a given name. It seems like those names clash for the md and lg models. So when you process the text again after loading the lg model, Thinc is using the wrong vectors, resulting in worse predictions.

@adrianeboyd
Copy link
Contributor Author

Thanks for the reply! That makes sense. I realize this is not the most common use case, but it's still a bit unexpected, so if it's not something that can be fixed easily, maybe a warning when you load a conflicting model could be helpful?

@honnibal
Copy link
Member

We'll at least add a warning in the next version, but I definitely do think this is a bug we should fix. Thanks again for the report.

honnibal added a commit that referenced this issue Jul 11, 2019
@honnibal
Copy link
Member

honnibal commented Jul 11, 2019

When we repackage the models, we need to take care that the vectors.name attribute is more specific. The question is whether we should add a hack to fix this, after we detect the problem.

We could change the value of nlp.vocab.vectors.name and also update the entry in nlp.meta["vectors"]["name"], changing it to something like nlp.vocab.vectors.name + "_%d" % nlp.vocab.vectors.shape[0] instead of printing the warning. I think this should fix the issue? I'm not sure whether it'll lead to further problems, though. Still, it might be worth the hack. The current behaviour is pretty bad, after all, even with the warning.

honnibal added a commit that referenced this issue Jul 11, 2019
@honnibal
Copy link
Member

Warn-and-continue was kind of a dumb behaviour, since the results for the model loaded first would predictably be bad. We may as well try changing the name. I added a warning pointing people here as well, so that it's easier to find the context if the problem is encountered.

We should fix this properly in the v2.2 line of models, by making the vector names more specific.

@erotavlas
Copy link

@honnibal does this have any effect on custom models loaded one after another?
(For example when doing k-fold cross validation I reload each model to compute the precision, recall and fscore metrics. )

@lock
Copy link

lock bot commented Aug 15, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 15, 2019
polm pushed a commit to polm/spaCy that referenced this issue Aug 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / vectors Feature: Word vectors and similarity models Issues related to the statistical models
Projects
None yet
Development

No branches or pull requests

4 participants