Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when using span.as_doc() method #3669

Closed
apuranik1 opened this issue May 3, 2019 · 7 comments
Closed

Segmentation fault when using span.as_doc() method #3669

apuranik1 opened this issue May 3, 2019 · 7 comments
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects

Comments

@apuranik1
Copy link

How to reproduce the behaviour

I am trying to parse a number of reddit comments, dumped from pushshift.io. A file with a sample of these comments (about 130,000) can be found here: comment_sample_100k.csv.

The following script is a heavily shortened version of the one I'm using to do the parsing: spacy_failure.py. There are two custom components in the pipeline: the tokenizer and the sentence boundary setter. The script builds the custom Language object, reads lines in csv format from stdin, parses them, and throws away the result. Running the script looks like:

$ python failure_script.py < comment_sample_100k.csv

More than half the time, this command will eventually segfault. It's nondeterministic - sometimes it crashes in 30 seconds, other times it runs for 10 minutes, and sometimes it finishes successfully.

This is about as short as I could make the script while reproducing the error. Lots of seemingly unrelated lines will prevent the error from occurring. Certain small details that prevent the error are marked with comments. Certain other lines that aren't critical to the error but seem to increase the crash rate are also marked. Occasionally, instead of a segfault, the CSV reader will be corrupted and raise an exception.

My best guess is the issue is related to resource cleanup of the object returned by sent.as_doc().to_bytes(). Changing where the Sentence object gets garbage collected seems to change the outcome.

I'm using spaCy 2.1.0, but I've replicated it on 2.1.3 as well.

Info about spaCy

  • spaCy version: 2.1.0
  • Platform: Darwin-18.2.0-x86_64-i386-64bit
  • Python version: 3.7.2
  • Models: en
@ines ines added bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects labels May 3, 2019
@honnibal
Copy link
Member

honnibal commented May 3, 2019

Thanks for the report.

Could you try avoiding the span.as_doc() call? I've had trouble with this before, and it doesn't work the way it was originally intended. Originally I wanted it to be a zero-copy operation, but that didn't work. I'm very suspicious that this could be where the bug is, as it's a quite untested method that had bugs previously.

You might find the serialization code here useful: https:/explosion/spaCy/blob/master/spacy/tokens/_serialize.py This lets you serialize a collection of Doc objects in an efficient format. It should let you then reconstruct your sentences, given the Doc objects.

@apuranik1
Copy link
Author

Thanks for the response! In my own testing the code ran fine without the as_doc call, so I'm glad to hear that's the most likely source of the issue. I can work around calling that method.

@honnibal honnibal changed the title Segmentation fault when parsing documents with custom tokenization and sentence boundaries Segmentation fault when using span.as_doc() method May 11, 2019
@thomasopsomer
Copy link
Contributor

thomasopsomer commented Jun 8, 2019

I confirm there is something happening with the as_doc method :/ The pb seems to be in the Span.to_array method.

I used to do model(span.as_doc()) to apply a model like NER or TextCategorizer on a span but it seems to be breaking now. What would be the best way to apply a model on a span ?

@dpraul
Copy link

dpraul commented Jun 26, 2019

Also running into this issue. Similar to @thomasopsomer, we're trying to run a Matcher on each entity in doc.ents to do some extra processing

@dpraul
Copy link

dpraul commented Jun 27, 2019

Did some more investigation here. It seems that the issue only shows itself when the DependencyParser component is enabled in the pipeline. With it disabled, we aren't able to reproduce the segfault.

EDIT: Did some more digging. Made a replica of Span.as_doc() and removed HEAD from the list of attrs and the segfaults stop don't happen!

@svlandeg
Copy link
Member

svlandeg commented Jul 15, 2019

@dpraul : yep, you're right. I tried fixing this in #3969: instead of simply removing all head information from the span (doc), we can just keep the ones that refer to tokens inside the span.
So I am hopeful that this PR would fix your issue (and others above), too.

@ines ines closed this as completed Jul 23, 2019
@lock
Copy link

lock bot commented Aug 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects
Projects
None yet
Development

No branches or pull requests

6 participants