Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Norwegian Bokmål sentence segmentation not working #4401

Closed
emilmuller opened this issue Oct 8, 2019 · 6 comments
Closed

Norwegian Bokmål sentence segmentation not working #4401

emilmuller opened this issue Oct 8, 2019 · 6 comments
Labels
feat / parser Feature: Dependency Parser lang / nb Norwegian (Bokmål) language data and models models Issues related to the statistical models perf / accuracy Performance: accuracy

Comments

@emilmuller
Copy link

Doing this:

import spacy
nlp = spacy.load("nb_core_news_sm")
doc = nlp("Hei på deg. Jeg har det fint.")
for sent in doc.sents:
  print(sent.text)

Outputs:

Hei på deg. Jeg har det fint.

:(

There's some mention in #3082 that maybe sbd needs to be added to the pipeline? I'm not sure if it is, how I can do it, or if it should be there by default?

@ines
Copy link
Member

ines commented Oct 8, 2019

The default sentence segmentation happens via the dependency parser – so this seems to have not worked in this example. You can always add the rule-based sentencizer to the pipeline, though, which uses a simpler strategy: https://spacy.io/usage/linguistic-features#sbd-component

@ines ines added feat / parser Feature: Dependency Parser lang / nb Norwegian (Bokmål) language data and models usage General spaCy usage labels Oct 8, 2019
@svlandeg
Copy link
Member

svlandeg commented Oct 8, 2019

Also I'd like to point out that the name sbd is now deprecated in favour of the name sentencizer.

@ines : maybe to avoid confusion we should also rename the link in the docs?

@ines
Copy link
Member

ines commented Oct 8, 2019

maybe to avoid confusion we should also rename the link in the docs?

The anchors are the only references I kept to not break backwards compatibility – and rewriting anchor links is a pain 😞

Also, on the topic of the nb model: I can confirm that the current Norwegian parser doesn't seem to segment sentences.

@adrianeboyd
Copy link
Contributor

The training data only consists of individual sentences as documents, so the model seems to be determined to predict only one ROOT per text. I will work on it a bit to create some fake paragraphs so it can be retrained.

@adrianeboyd adrianeboyd added bug Bugs and behaviour differing from documentation and removed usage General spaCy usage labels Oct 8, 2019
@adrianeboyd adrianeboyd added models Issues related to the statistical models perf / accuracy Performance: accuracy and removed bug Bugs and behaviour differing from documentation labels Oct 9, 2019
@ines
Copy link
Member

ines commented Oct 21, 2019

Just released a new version of the nb_core_news_sm model that should resolve the problem 😃

@ines ines closed this as completed Oct 21, 2019
@lock
Copy link

lock bot commented Nov 20, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Nov 20, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / parser Feature: Dependency Parser lang / nb Norwegian (Bokmål) language data and models models Issues related to the statistical models perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

4 participants