Doc.is_sentenced is False for single-token Docs #3934

NixBiks · 2019-07-09T17:07:26Z

The following code should pass imo.

from spacy.lang.en import English

nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))

assert len([s for s in nlp('The sentencizer is working fine. Right').sents]) == 2
assert len([s for s in nlp('a').sents]) == 1

but it produces the following error due to the last line

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.

Info about spaCy

spaCy version: 2.1.4
Platform: Darwin-18.6.0-x86_64-i386-64bit
Python version: 3.7.3

ines · 2019-07-09T18:19:11Z

This is currently expected, because the last Doc contains only of one token. I've explained more details in #3907:

I think the underlying issue here is that you end up with a single-word Doc. With the current implementation, the Token.is_sent_start property defaults to True if it's the very first token in a Doc. But if there's only one token, spaCy is unable to tell if sentence boundaries have been set or not.

We do want to change this in the future, but for now, you kind of have to accept that if your Doc only contains one token, sentence boundaries will appear as unset. The solution for this internally would be to not only make Doc.is_sentenced look at whether all tokens in the Doc except the first one have is_sent_start set to True or False. Instead, there should also be a separate flag that the sentencizer and parser can set to indicate that they ran on the Doc.

honnibal · 2019-07-10T17:19:56Z

There's kind of no good policy around this unfortunately, but I think returning True is probably the least bad option. Our choices are:

Return True
Return False
Return something else
Raise
Add an extra field on the Doc

3 and 4 are obviously terrible. 5 sounds like a good solution, but the problem is that we'll be losing the information when we do doc.from_array(), which is pretty undesirable. So it's really between 1 and 2. I think if we have to choose, returning True makes more sense than returning False.

lock · 2019-08-09T17:42:27Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

* Make doc.is_sentenced return True if len(doc) < 2. * Make doc.is_nered return True if len(doc) == 0, for consistency. Closes explosion#3934

ines added enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects labels Jul 9, 2019

ines mentioned this issue Jul 9, 2019

Span.sent is None for single-token Docs #3907

Closed

ines changed the title ~~Sentencizer failing on URL~~ Doc.is_sentenced is False for single-token Docs Jul 9, 2019

honnibal closed this as completed in 3d18600 Jul 10, 2019

lock bot locked as resolved and limited conversation to collaborators Aug 9, 2019

polm pushed a commit to polm/spaCy that referenced this issue Aug 18, 2019

Return True from doc.is_... when no ambiguity

52ec915

* Make doc.is_sentenced return True if len(doc) < 2. * Make doc.is_nered return True if len(doc) == 0, for consistency. Closes explosion#3934

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc.is_sentenced is False for single-token Docs #3934

Doc.is_sentenced is False for single-token Docs #3934

NixBiks commented Jul 9, 2019

ines commented Jul 9, 2019

honnibal commented Jul 10, 2019

lock bot commented Aug 9, 2019

Doc.is_sentenced is False for single-token Docs #3934

Doc.is_sentenced is False for single-token Docs #3934

Comments

NixBiks commented Jul 9, 2019

Info about spaCy

ines commented Jul 9, 2019

honnibal commented Jul 10, 2019

lock bot commented Aug 9, 2019