Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc.is_sentenced is False for single-token Docs #3934

Closed
NixBiks opened this issue Jul 9, 2019 · 3 comments
Closed

Doc.is_sentenced is False for single-token Docs #3934

NixBiks opened this issue Jul 9, 2019 · 3 comments
Labels
enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects

Comments

@NixBiks
Copy link
Contributor

NixBiks commented Jul 9, 2019

The following code should pass imo.

from spacy.lang.en import English

nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))

assert len([s for s in nlp('The sentencizer is working fine. Right').sents]) == 2
assert len([s for s in nlp('a').sents]) == 1

but it produces the following error due to the last line

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.

Info about spaCy

  • spaCy version: 2.1.4
  • Platform: Darwin-18.6.0-x86_64-i386-64bit
  • Python version: 3.7.3
@ines
Copy link
Member

ines commented Jul 9, 2019

This is currently expected, because the last Doc contains only of one token. I've explained more details in #3907:

I think the underlying issue here is that you end up with a single-word Doc. With the current implementation, the Token.is_sent_start property defaults to True if it's the very first token in a Doc. But if there's only one token, spaCy is unable to tell if sentence boundaries have been set or not.

We do want to change this in the future, but for now, you kind of have to accept that if your Doc only contains one token, sentence boundaries will appear as unset. The solution for this internally would be to not only make Doc.is_sentenced look at whether all tokens in the Doc except the first one have is_sent_start set to True or False. Instead, there should also be a separate flag that the sentencizer and parser can set to indicate that they ran on the Doc.

@ines ines added enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects labels Jul 9, 2019
@ines ines changed the title Sentencizer failing on URL Doc.is_sentenced is False for single-token Docs Jul 9, 2019
@honnibal
Copy link
Member

There's kind of no good policy around this unfortunately, but I think returning True is probably the least bad option. Our choices are:

  1. Return True
  2. Return False
  3. Return something else
  4. Raise
  5. Add an extra field on the Doc

3 and 4 are obviously terrible. 5 sounds like a good solution, but the problem is that we'll be losing the information when we do doc.from_array(), which is pretty undesirable. So it's really between 1 and 2. I think if we have to choose, returning True makes more sense than returning False.

@lock
Copy link

lock bot commented Aug 9, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 9, 2019
polm pushed a commit to polm/spaCy that referenced this issue Aug 18, 2019
* Make doc.is_sentenced return True if len(doc) < 2.

* Make doc.is_nered return True if len(doc) == 0, for consistency.

Closes explosion#3934
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects
Projects
None yet
Development

No branches or pull requests

3 participants