Norwegian Bokmål model handles lemmatization process for NOUNs with incorrectly results #5658

malakhovks · 2020-06-28T07:58:36Z

Norwegian Bokmål model 2.3.0 handles lemmatization process for NOUNs with incorrectly results.

For example in the sentence Formuesskatten er en skatt som utlignes på grunnlag av nettoformuen din. not correctly determined lemma of Formuesskatten --> lemma Formuesskatten, correct lemma is Formuesskatt in this case.

For the previous release of Norwegian Bokmål model 2.2.5 the lemma of Formuesskatten is correctly determined.

This error affects the subsequent process of decomposition of compound NOUNs.
If correct then:
NOUN formuesskatten --> lemma --> formuesskatt --> samset-leks +skatt

If incorrect then:
NOUN formuesskatten --> lemma --> formuesskatten --> samset-leks +skatten

For now I use older model (v2.2.5) for such kind of tasks.

How to reproduce the behaviour

import spacy

nlp = spacy.load("nb_core_news_sm")
doc = nlp("Formuesskatten er en skatt som utlignes på grunnlag av nettoformuen din.")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Result:

Formuesskatten formuesskatten NOUN NOUN__Definite=Def|Gender=Masc|Number=Sing nsubj Xxxxx True False
er er AUX AUX__Mood=Ind|Tense=Pres|VerbForm=Fin cop xx True True
en en DET DET__Gender=Masc|Number=Sing|PronType=Art det xx True True
skatt skatt NOUN NOUN__Definite=Ind|Gender=Masc|Number=Sing ROOT xxxx True False
som som PRON PRON__PronType=Rel nsubj:pass xxx True True
utlignes utlignes VERB VERB__Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass acl:relcl xxxx True False
på på ADP ADP case xx True True
grunnlag grunnlag NOUN NOUN__Definite=Ind|Gender=Neut|Number=Sing obl xxxx True False
av av ADP ADP case xx True True
nettoformuen nettoformuen NOUN NOUN__Definite=Def|Gender=Masc|Number=Sing nmod xxxx True False
din din PRON PRON__Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs nmod xxx True False
. . PUNCT PUNCT punct . False False

Your Environment

Operating System: macOS 10.11.6
spaCy version: 2.3.0
Platform: macOS-10.11.6-x86_64-i386-64bit
Python version: 3.8.3

python -m spacy info

============================== Info about spaCy ==============================

spaCy version    2.3.0                         
Location         /Users/MalahovKS/Documents/Velychko/2020/nor-projects/nb-terms/venv/lib/python3.8/site-packages/spacy
Platform         macOS-10.11.6-x86_64-i386-64bit
Python version   3.8.3

python -m spacy validate
✔ Loaded compatibility table

====================== Installed models (spaCy v2.3.0) ======================
ℹ spaCy installation:
/Users/MalahovKS/Documents/Velychko/2020/nor-projects/nb-terms/venv/lib/python3.8/site-packages/spacy

TYPE      NAME              MODEL             VERSION                            
package   nb-core-news-sm   nb_core_news_sm   2.3.0   ✔

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2020-06-29T07:27:34Z

Thanks for the report! I can replicate this and it looks like a bug in the lemmatizer.

The 2.3.0 models include more consistent tag maps with morphological features from the UD corpora, but it looks like the presence of the morphological features triggered some older English-specific code that skips lemmatization for singular nouns, which is clearly a bug here. We'll look into a fix!

malakhovks · 2020-06-29T07:30:51Z

Thanks for the report! I can replicate this and it looks like a bug in the lemmatizer.

The 2.3.0 models include more consistent tag maps with morphological features from the UD corpora, but it looks like the presence of the morphological features triggered some older English-specific code that skips lemmatization for singular nouns, which is clearly a bug here. We'll look into a fix!

Great 👍

adrianeboyd · 2020-06-30T09:36:37Z

Okay, this should be fixed in the upcoming v2.3.1 by #5663.

github-actions · 2021-11-04T00:01:52Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / nb Norwegian (Bokmål) language data and models perf / accuracy Performance: accuracy labels Jun 28, 2020

adrianeboyd added the bug Bugs and behaviour differing from documentation label Jun 29, 2020

adrianeboyd closed this as completed Jun 30, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Norwegian Bokmål model handles lemmatization process for NOUNs with incorrectly results #5658

Norwegian Bokmål model handles lemmatization process for NOUNs with incorrectly results #5658

malakhovks commented Jun 28, 2020 •

edited

Loading

adrianeboyd commented Jun 29, 2020

malakhovks commented Jun 29, 2020

adrianeboyd commented Jun 30, 2020

github-actions bot commented Nov 4, 2021

Norwegian Bokmål model handles lemmatization process for NOUNs with incorrectly results #5658

Norwegian Bokmål model handles lemmatization process for NOUNs with incorrectly results #5658

Comments

malakhovks commented Jun 28, 2020 • edited Loading

How to reproduce the behaviour

Your Environment

adrianeboyd commented Jun 29, 2020

malakhovks commented Jun 29, 2020

adrianeboyd commented Jun 30, 2020

github-actions bot commented Nov 4, 2021

malakhovks commented Jun 28, 2020 •

edited

Loading