Wrong lemmatization using only tokenization (en language) #4102

ajrader · 2019-08-09T21:30:33Z

Observation

I noticed if I tried to use spacy for tokenization only (disabling 'ner', 'parser' and 'tagger') that i was getting an erroneous mapping for the english word 'spun'. As long as tagger is not disabled, this past tense verb is correctly mapped to 'spin'. But when 'tagger' is disabled then it gets mapped to 'spin-dry' which in my estimation is wrong.

I think the source of this error is due to line 35297 in https:/explosion/spaCy/tree/master/spacy/lang/en/lemmatizer/lookup.py
"spun": "spin-dry"

How to reproduce the behavior

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('spun', disable=['tagger'])
for tok in doc0:
print(tok.text, tok.lemma_)

The output of this is
spun spin-dry

I noticed the same behavior when using 'en_core_web_lg' and 'en_core_web_md' because I think this is the default lemmatizer mapping in all 'en' dictionaries.

Your Environment

Operating System: Windows 10
Python Version Used: 3.7.3
spaCy Version Used: 2.1.7
Environment Information: installed with conda

The text was updated successfully, but these errors were encountered:

svlandeg · 2019-08-10T21:20:08Z

It looks like the look-up lemmatization indeed has two weird entries:

"spun": "spin-dry",
"dry": "spin-dry",

I guess it should be

"dry": "dry",
"spun": "spin",
"spun-dry": "spin-dry",

instead. You could make the edit to lookup.py and create a pull request if you like.

However, please also mind Ines' recent comment about lookup lemmatizations: they should be made redundant for languages where there is better lemmatization available.

lock · 2019-09-14T09:42:48Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added feat / lemmatizer Feature: Rule-based and lookup lemmatization perf / accuracy Performance: accuracy labels Aug 10, 2019

ajrader mentioned this issue Aug 12, 2019

Correction of default lemmatizer lookup in English (Issue # 4104) #4110

Merged

3 tasks

svlandeg added the lang / en English language data and models label Aug 12, 2019

ines closed this as completed in #4110 Aug 15, 2019

lock bot locked as resolved and limited conversation to collaborators Sep 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong lemmatization using only tokenization (en language) #4102

Wrong lemmatization using only tokenization (en language) #4102

ajrader commented Aug 9, 2019

svlandeg commented Aug 10, 2019

lock bot commented Sep 14, 2019

Wrong lemmatization using only tokenization (en language) #4102

Wrong lemmatization using only tokenization (en language) #4102

Comments

ajrader commented Aug 9, 2019

Observation

How to reproduce the behavior

Your Environment

svlandeg commented Aug 10, 2019

lock bot commented Sep 14, 2019