Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assigning vectors to OOV words #5170

Closed
maxmealy opened this issue Mar 19, 2020 · 11 comments · Fixed by #5266
Closed

Assigning vectors to OOV words #5170

maxmealy opened this issue Mar 19, 2020 · 11 comments · Fixed by #5266
Labels
bug Bugs and behaviour differing from documentation feat / ner Feature: Named Entity Recognizer feat / vectors Feature: Word vectors and similarity usage General spaCy usage

Comments

@maxmealy
Copy link

What is the correct way to modify the vectors for out of vocabulary words, so that the updated vectors are used by NER? I am trying to do it below and it is not working as I would expect. Thanks.

nlp = spacy.load("en_core_web_md")

doc = nlp("He traveled to Paris last week")
doc.ents[0].label_ >>> GPE

doc = nlp("He traveled to Lutetia last week")
doc.ents[0].label_ >>> ORG

nlp.vocab.set_vector("Lutetia", nlp.vocab["Paris"].vector)
doc = nlp("He traveled to Lutetia last week")
doc.ents[0].label_ >>> ORG
@svlandeg svlandeg added feat / ner Feature: Named Entity Recognizer feat / vectors Feature: Word vectors and similarity usage General spaCy usage labels Mar 19, 2020
@adrianeboyd
Copy link
Contributor

I think the model is using the vector, but the vector isn't the only feature used by the model so it's still predicting ORG. The model also uses the word form, prefix (1 character), suffix (3 characters), and shape (Xxxxx).

If you have a list of entities, you could use the EntityRuler in combination with the ner component or if you have example sentences including your entities (and with all other OntoNotes entity types labeled or you'll run into the catastrophic forgetting problem), you could update the model as described here: https://spacy.io/usage/training#ner

@maxmealy
Copy link
Author

Are we sure NER uses the updated vector? Using a longer place name and then shifting all the non-prefix/suffix letters forward one, I get a similar result:

doc = nlp("He traveled to Palestine last week")
doc.ents[0].label_ >>> GPE

doc = nlp("He traveled to Pbmftuine last week")
doc.ents[0].label_ >>> ORG

nlp.vocab.set_vector("Pbmftuine", nlp.vocab["Palestine"].vector)
doc = nlp("He traveled to Pbmftuine last week")
doc.ents[0].label_ >>> ORG

I see what you are saying about more training, but if the entity vectors are zero, I'm not sure that will get the same results if there is insufficient context.

@evaldask
Copy link

Running this on spacy 2.2.3 and en_core_web_md version 2.2.5 I get these results:

>>> nlp = spacy.load("en_core_web_md")
>>> doc = nlp("He traveled to Lutetia last week")
>>> doc.ents[0].label_
"GPE"
>>> doc = nlp("He traveled to Pbmftuine last week")
>>> doc.ents[0].label_
"GPE"
>>> nlp.meta["version"] # model version
"2.2.5"

Try to update en_core_web_md model version.

@maxmealy
Copy link
Author

@evalkaz It looks like 2.2.5 weights are a little different and make a GPE prediction using context, even with an OOV word with a 0 vector. However, the underlying issue here is unchanged: What is the correct way to change a word's vector, so that the new vector is used by the NER model? Here is an example with no context to isolate the issue a little more:

>>> nlp = spacy.load("en_core_web_md")
>>> doc = nlp("Palestine")
>>> len(doc.ents)
1
>>> nlp.vocab.set_vector("Pbmftuine", nlp.vocab["Palestine"].vector)
>>> doc = nlp("Pbmftuine")
>>> len(doc.ents)
0

@adrianeboyd
Copy link
Contributor

@evalkaz Can you try saving the model to disk (nlp.to_disk("/path/to/model")) and then reloading it with spacy.load(/path/to/model) to see if this makes a difference?

I think I found a spot where vectors.set_vector() is missing a step that links the lexeme to the right vector, but reloading should fix this. Because the word itself is also a feature Pbmftuine is never going to be identical to Palestine, but this would make all the other features identical.

@evaldask
Copy link

evaldask commented Apr 1, 2020

@adrianeboyd yeah, I can reproduce this. After saving and reloading the model Pbmftuine is recognized as an entity, without reloading not.

@adrianeboyd adrianeboyd added the bug Bugs and behaviour differing from documentation label Apr 1, 2020
@adrianeboyd
Copy link
Contributor

Thanks for confirming that! This does look like a bug then.

@maxmealy
Copy link
Author

maxmealy commented Apr 7, 2020

@adrianeboyd @honnibal I rebuilt spaCy from the latest changes and don't believe #5266 actually resolves the issue here. To restate it, changing the vectors for a word via nlp.vocab doesn't seem to change the vector that is being fed into the NER model. Saving/Loading from disk seems to fix it.

>>> nlp = spacy.load('en_core_web_md')
>>> tok2vec = nlp.get_pipe("ner").model.tok2vec
>>> text = "Pbmftuine"
>>> veca = tok2vec([nlp(text)])
>>> nlp.vocab.set_vector("Pbmftuine", nlp.vocab["Palestine"].vector)
>>> nlp.vocab.set_vector("Pbmftuine".lower(), nlp.vocab["Palestine".lower()].vector)
>>> vecb = tok2vec([nlp(text)])
>>> np.array_equal(veca, vecb)
True

@adrianeboyd
Copy link
Contributor

Hmm, I'm not sure what's going on then. I'll reopen this for now.

@adrianeboyd adrianeboyd reopened this Apr 8, 2020
@svlandeg
Copy link
Member

This seems to work properly now with spaCy v3:

import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("Palestine")
print(len(doc.ents))

doc = nlp("Pbmftuine")
print(len(doc.ents))

nlp.vocab.set_vector("Pbmftuine", nlp.vocab["Palestine"].vector)
doc = nlp("Pbmftuine")
print(len(doc.ents))

gives me:

1
0
1

So I'll tentatively close this :-)

@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / ner Feature: Named Entity Recognizer feat / vectors Feature: Word vectors and similarity usage General spaCy usage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants