Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser not obeying is_sent_start == False (regression) #2772

Closed
christian-storm opened this issue Sep 19, 2018 · 4 comments
Closed

parser not obeying is_sent_start == False (regression) #2772

christian-storm opened this issue Sep 19, 2018 · 4 comments
Labels
bug Bugs and behaviour differing from documentation feat / parser Feature: Dependency Parser 🌙 nightly Discussion and contributions related to nightly builds

Comments

@christian-storm
Copy link

Not only is spacy nightly not obeying is_sent_start it also is producing a bad sentence segmentation.

How to reproduce the behaviour

import spacy

text = 'When we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us.'


def sbd_component(doc):
    doc[0].is_sent_start = True
    for i, token in enumerate(doc[1:]):
        # define sentence start after space token
        if doc[i-1].is_space:
            doc[i].is_sent_start = True
        else:
            doc[i].is_sent_start = False
    return doc

# Bad builtin sbd
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
for i, sent in enumerate(doc.sents):
    print(i, sent.text)

# Spacy isn't honoring is_sent_start and producing the same poor sbd
nlp.add_pipe(sbd_component, before='parser')  # insert before the parser
doc = nlp(text)
for i, sent in enumerate(doc.sents):
    print(i, sent.text)

I expected
1 When we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us.
1 When we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us.

But get this instead
1 When
2 we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us
1 When
2 we write or communicate virtually, we can hide our through feelings and many not become ourselves since we do not want the other party to judge us

When I run this on the binder (https://spacy.io/usage/processing-pipelines#component-example1) it works as expected.

Your Environment

  • spaCy version: 2.1.0a1
  • Platform: Darwin-17.7.0-x86_64-i386-64bit
  • Python version: 3.7.0
  • Models: en_core_web_md, es_core_news_md, en_core_web_lg, en_core_web_sm
@ines ines added 🌙 nightly Discussion and contributions related to nightly builds feat / parser Feature: Dependency Parser labels Sep 19, 2018
@honnibal
Copy link
Member

Thanks, I've been chasing this bug for a while on develop. I think it's occurring in set_heads_from_children. The parse and sentence boundaries actually disagree here.

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Sep 19, 2018
@honnibal
Copy link
Member

The issue arises when we have non-projective dependencies (aka crossing brackets). The parser is constrained to produce only projective trees, but there's a pre- and post- processing trick to make the parser predict non-projective analyses. After deprojectivisation, we run the set_children_from_heads routine, which was written with the assumption that the parse is projective --- but this assumption is no longer true, causing the error.

honnibal added a commit that referenced this issue Sep 19, 2018
The set_children_from_heads function assumed parse trees were
projective. However, non-projective parses may be passed in during
deserialization, or after deprojectivising. This caused incorrect
sentence boundaries to be set for non-projective parses. Close #2772.
@honnibal
Copy link
Member

Thanks again for the test case. Fixed now! This had held up the experiments on the universal dependencies corpus, as there are many more non-projective parses there.

@lock
Copy link

lock bot commented Oct 19, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 19, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / parser Feature: Dependency Parser 🌙 nightly Discussion and contributions related to nightly builds
Projects
None yet
Development

No branches or pull requests

3 participants