Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc.to_json() method fails when called on doc generated via Span.as_doc() #3962

Closed
JohnStuartRutledge opened this issue Jul 13, 2019 · 3 comments
Assignees
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects

Comments

@JohnStuartRutledge
Copy link

JohnStuartRutledge commented Jul 13, 2019

How to reproduce the behaviour

It would appear that calling to_json() on a Doc that was created via the as_doc() method is prone to failure. This may very well be connected to other issues with with as_doc like the that mentioned in issue #3669

import spacy

nlp = spacy.load('en')
doc = nlp('He jests at scars, that never felt a wound.')
span = doc[0:3]      # "He jests at"
doc2 = span.as_doc()
doc2.to_json()
# IndexError: [E040] Attempt to access token at 7, max length 3.

The above error seems to trigger when second token of the span (jests), calls spacy.tokens.Token.head.__get__() and hits the self.c.head

Environment

  • spaCy version: 2.1.6
  • Platform: Darwin-18.6.0-x86_64-i386-64bit
  • Python version: 3.6.4
  • Models: en
@svlandeg svlandeg added feat / doc Feature: Doc, Span and Token objects more-info-needed This issue needs more information bug Bugs and behaviour differing from documentation and removed more-info-needed This issue needs more information labels Jul 15, 2019
@svlandeg
Copy link
Member

Yep, I can replicate this bug with en_core_web_md (not with en_core_web_lg) - looks like it's definitely connected to the trouble caused by the dependency parsing, as discussed in that other issue...

@svlandeg svlandeg added models Issues related to the statistical models and removed models Issues related to the statistical models labels Jul 15, 2019
@svlandeg svlandeg self-assigned this Jul 15, 2019
svlandeg added a commit to svlandeg/spaCy that referenced this issue Jul 15, 2019
@svlandeg
Copy link
Member

@JohnStuartRutledge : I think we can fix this, cf. #3969. Basically what happens in your example, is that the second word jests refers to felt (index: 7) as its head word in the dependency graph, but after the as_doc operation that index is out of bounds. So we need to prevent copying that kind of out-of-span information ...

honnibal pushed a commit that referenced this issue Jul 23, 2019
* failing unit test for issue 3962

* attempt to fix Issue #3962

* create artificial unit test example

* using length instead of self.length

* sp

* reformat with black

* find better ancestor within span and use generic 'dep'

* attach to span.root if there is no appropriate ancestor

* comment span text

* clean up ancestor code

* reconstruct dep tree to keep same number of sentences
@ines ines closed this as completed Jul 23, 2019
polm pushed a commit to polm/spaCy that referenced this issue Aug 18, 2019
* failing unit test for issue 3962

* attempt to fix Issue explosion#3962

* create artificial unit test example

* using length instead of self.length

* sp

* reformat with black

* find better ancestor within span and use generic 'dep'

* attach to span.root if there is no appropriate ancestor

* comment span text

* clean up ancestor code

* reconstruct dep tree to keep same number of sentences
@lock
Copy link

lock bot commented Aug 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects
Projects
None yet
Development

No branches or pull requests

3 participants