Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Norwegian Bokmål model handles lemmatization process for NOUNs with incorrectly results #5658

Closed
malakhovks opened this issue Jun 28, 2020 · 4 comments
Labels
bug Bugs and behaviour differing from documentation feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / nb Norwegian (Bokmål) language data and models perf / accuracy Performance: accuracy

Comments

@malakhovks
Copy link

malakhovks commented Jun 28, 2020

Norwegian Bokmål model 2.3.0 handles lemmatization process for NOUNs with incorrectly results.

For example in the sentence Formuesskatten er en skatt som utlignes på grunnlag av nettoformuen din. not correctly determined lemma of Formuesskatten --> lemma Formuesskatten, correct lemma is Formuesskatt in this case.

For the previous release of Norwegian Bokmål model 2.2.5 the lemma of Formuesskatten is correctly determined.

This error affects the subsequent process of decomposition of compound NOUNs.
If correct then:
NOUN formuesskatten --> lemma --> formuesskatt --> samset-leks +skatt

If incorrect then:
NOUN formuesskatten --> lemma --> formuesskatten --> samset-leks +skatten

For now I use older model (v2.2.5) for such kind of tasks.

How to reproduce the behaviour

import spacy

nlp = spacy.load("nb_core_news_sm")
doc = nlp("Formuesskatten er en skatt som utlignes på grunnlag av nettoformuen din.")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Result:

Formuesskatten formuesskatten NOUN NOUN__Definite=Def|Gender=Masc|Number=Sing nsubj Xxxxx True False
er er AUX AUX__Mood=Ind|Tense=Pres|VerbForm=Fin cop xx True True
en en DET DET__Gender=Masc|Number=Sing|PronType=Art det xx True True
skatt skatt NOUN NOUN__Definite=Ind|Gender=Masc|Number=Sing ROOT xxxx True False
som som PRON PRON__PronType=Rel nsubj:pass xxx True True
utlignes utlignes VERB VERB__Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass acl:relcl xxxx True False
på på ADP ADP case xx True True
grunnlag grunnlag NOUN NOUN__Definite=Ind|Gender=Neut|Number=Sing obl xxxx True False
av av ADP ADP case xx True True
nettoformuen nettoformuen NOUN NOUN__Definite=Def|Gender=Masc|Number=Sing nmod xxxx True False
din din PRON PRON__Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs nmod xxx True False
. . PUNCT PUNCT punct . False False

Your Environment

  • Operating System: macOS 10.11.6
  • spaCy version: 2.3.0
  • Platform: macOS-10.11.6-x86_64-i386-64bit
  • Python version: 3.8.3
python -m spacy info

============================== Info about spaCy ==============================

spaCy version    2.3.0                         
Location         /Users/MalahovKS/Documents/Velychko/2020/nor-projects/nb-terms/venv/lib/python3.8/site-packages/spacy
Platform         macOS-10.11.6-x86_64-i386-64bit
Python version   3.8.3             
python -m spacy validate
✔ Loaded compatibility table

====================== Installed models (spaCy v2.3.0) ======================
ℹ spaCy installation:
/Users/MalahovKS/Documents/Velychko/2020/nor-projects/nb-terms/venv/lib/python3.8/site-packages/spacy

TYPE      NAME              MODEL             VERSION                            
package   nb-core-news-sm   nb_core_news_sm   2.3.0   ✔
@svlandeg svlandeg added feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / nb Norwegian (Bokmål) language data and models perf / accuracy Performance: accuracy labels Jun 28, 2020
@adrianeboyd adrianeboyd added the bug Bugs and behaviour differing from documentation label Jun 29, 2020
@adrianeboyd
Copy link
Contributor

Thanks for the report! I can replicate this and it looks like a bug in the lemmatizer.

The 2.3.0 models include more consistent tag maps with morphological features from the UD corpora, but it looks like the presence of the morphological features triggered some older English-specific code that skips lemmatization for singular nouns, which is clearly a bug here. We'll look into a fix!

@malakhovks
Copy link
Author

Thanks for the report! I can replicate this and it looks like a bug in the lemmatizer.

The 2.3.0 models include more consistent tag maps with morphological features from the UD corpora, but it looks like the presence of the morphological features triggered some older English-specific code that skips lemmatization for singular nouns, which is clearly a bug here. We'll look into a fix!

Great 👍

@adrianeboyd
Copy link
Contributor

Okay, this should be fixed in the upcoming v2.3.1 by #5663.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 4, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / nb Norwegian (Bokmål) language data and models perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

3 participants