Norwegian Bokmål sentence segmentation not working #4401

emilmuller · 2019-10-08T12:04:02Z

Doing this:

import spacy
nlp = spacy.load("nb_core_news_sm")
doc = nlp("Hei på deg. Jeg har det fint.")
for sent in doc.sents:
  print(sent.text)

Outputs:

Hei på deg. Jeg har det fint.

:(

There's some mention in #3082 that maybe sbd needs to be added to the pipeline? I'm not sure if it is, how I can do it, or if it should be there by default?

The text was updated successfully, but these errors were encountered:

ines · 2019-10-08T12:15:57Z

The default sentence segmentation happens via the dependency parser – so this seems to have not worked in this example. You can always add the rule-based sentencizer to the pipeline, though, which uses a simpler strategy: https://spacy.io/usage/linguistic-features#sbd-component

svlandeg · 2019-10-08T12:19:10Z

Also I'd like to point out that the name sbd is now deprecated in favour of the name sentencizer.

@ines : maybe to avoid confusion we should also rename the link in the docs?

ines · 2019-10-08T12:33:45Z

maybe to avoid confusion we should also rename the link in the docs?

The anchors are the only references I kept to not break backwards compatibility – and rewriting anchor links is a pain 😞

Also, on the topic of the nb model: I can confirm that the current Norwegian parser doesn't seem to segment sentences.

adrianeboyd · 2019-10-08T18:26:04Z

The training data only consists of individual sentences as documents, so the model seems to be determined to predict only one ROOT per text. I will work on it a bit to create some fake paragraphs so it can be retrained.

ines · 2019-10-21T16:24:09Z

Just released a new version of the nb_core_news_sm model that should resolve the problem 😃

lock · 2019-11-20T16:54:50Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added feat / parser Feature: Dependency Parser lang / nb Norwegian (Bokmål) language data and models usage General spaCy usage labels Oct 8, 2019

adrianeboyd added bug Bugs and behaviour differing from documentation and removed usage General spaCy usage labels Oct 8, 2019

adrianeboyd mentioned this issue Oct 9, 2019

Sentence splitter not working for Greek #4408

Closed

adrianeboyd added models Issues related to the statistical models perf / accuracy Performance: accuracy and removed bug Bugs and behaviour differing from documentation labels Oct 9, 2019

ines closed this as completed Oct 21, 2019

lock bot locked as resolved and limited conversation to collaborators Nov 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Norwegian Bokmål sentence segmentation not working #4401

Norwegian Bokmål sentence segmentation not working #4401

emilmuller commented Oct 8, 2019

ines commented Oct 8, 2019

svlandeg commented Oct 8, 2019

ines commented Oct 8, 2019

adrianeboyd commented Oct 8, 2019

ines commented Oct 21, 2019

lock bot commented Nov 20, 2019

Norwegian Bokmål sentence segmentation not working #4401

Norwegian Bokmål sentence segmentation not working #4401

Comments

emilmuller commented Oct 8, 2019

ines commented Oct 8, 2019

svlandeg commented Oct 8, 2019

ines commented Oct 8, 2019

adrianeboyd commented Oct 8, 2019

ines commented Oct 21, 2019

lock bot commented Nov 20, 2019