Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong lemmatization using only tokenization (en language) #4102

Closed
ajrader opened this issue Aug 9, 2019 · 2 comments · Fixed by #4110
Closed

Wrong lemmatization using only tokenization (en language) #4102

ajrader opened this issue Aug 9, 2019 · 2 comments · Fixed by #4110
Labels
feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models perf / accuracy Performance: accuracy

Comments

@ajrader
Copy link
Contributor

ajrader commented Aug 9, 2019

Observation

I noticed if I tried to use spacy for tokenization only (disabling 'ner', 'parser' and 'tagger') that i was getting an erroneous mapping for the english word 'spun'. As long as tagger is not disabled, this past tense verb is correctly mapped to 'spin'. But when 'tagger' is disabled then it gets mapped to 'spin-dry' which in my estimation is wrong.

I think the source of this error is due to line 35297 in https:/explosion/spaCy/tree/master/spacy/lang/en/lemmatizer/lookup.py
"spun": "spin-dry"

How to reproduce the behavior

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('spun', disable=['tagger'])
for tok in doc0:
print(tok.text, tok.lemma_)

The output of this is
spun spin-dry

I noticed the same behavior when using 'en_core_web_lg' and 'en_core_web_md' because I think this is the default lemmatizer mapping in all 'en' dictionaries.

Your Environment

  • Operating System: Windows 10
  • Python Version Used: 3.7.3
  • spaCy Version Used: 2.1.7
  • Environment Information: installed with conda
@svlandeg
Copy link
Member

It looks like the look-up lemmatization indeed has two weird entries:

"spun": "spin-dry",
"dry": "spin-dry",

I guess it should be

"dry": "dry",
"spun": "spin",
"spun-dry": "spin-dry",

instead. You could make the edit to lookup.py and create a pull request if you like.

However, please also mind Ines' recent comment about lookup lemmatizations: they should be made redundant for languages where there is better lemmatization available.

@svlandeg svlandeg added feat / lemmatizer Feature: Rule-based and lookup lemmatization perf / accuracy Performance: accuracy labels Aug 10, 2019
@svlandeg svlandeg added the lang / en English language data and models label Aug 12, 2019
@lock
Copy link

lock bot commented Sep 14, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models perf / accuracy Performance: accuracy
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants