-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong lemmatization using only tokenization (en language) #4102
Comments
It looks like the look-up lemmatization indeed has two weird entries:
I guess it should be
instead. You could make the edit to lookup.py and create a pull request if you like. However, please also mind Ines' recent comment about lookup lemmatizations: they should be made redundant for languages where there is better lemmatization available. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Observation
I noticed if I tried to use spacy for tokenization only (disabling 'ner', 'parser' and 'tagger') that i was getting an erroneous mapping for the english word 'spun'. As long as tagger is not disabled, this past tense verb is correctly mapped to 'spin'. But when 'tagger' is disabled then it gets mapped to 'spin-dry' which in my estimation is wrong.
I think the source of this error is due to line 35297 in https:/explosion/spaCy/tree/master/spacy/lang/en/lemmatizer/lookup.py
"spun": "spin-dry"
How to reproduce the behavior
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('spun', disable=['tagger'])
for tok in doc0:
print(tok.text, tok.lemma_)
The output of this is
spun spin-dry
I noticed the same behavior when using 'en_core_web_lg' and 'en_core_web_md' because I think this is the default lemmatizer mapping in all 'en' dictionaries.
Your Environment
The text was updated successfully, but these errors were encountered: