Segmentation fault when using span.as_doc() method #3669

apuranik1 · 2019-05-03T00:30:02Z

How to reproduce the behaviour

I am trying to parse a number of reddit comments, dumped from pushshift.io. A file with a sample of these comments (about 130,000) can be found here: comment_sample_100k.csv.

The following script is a heavily shortened version of the one I'm using to do the parsing: spacy_failure.py. There are two custom components in the pipeline: the tokenizer and the sentence boundary setter. The script builds the custom Language object, reads lines in csv format from stdin, parses them, and throws away the result. Running the script looks like:

$ python failure_script.py < comment_sample_100k.csv

More than half the time, this command will eventually segfault. It's nondeterministic - sometimes it crashes in 30 seconds, other times it runs for 10 minutes, and sometimes it finishes successfully.

This is about as short as I could make the script while reproducing the error. Lots of seemingly unrelated lines will prevent the error from occurring. Certain small details that prevent the error are marked with comments. Certain other lines that aren't critical to the error but seem to increase the crash rate are also marked. Occasionally, instead of a segfault, the CSV reader will be corrupted and raise an exception.

My best guess is the issue is related to resource cleanup of the object returned by sent.as_doc().to_bytes(). Changing where the Sentence object gets garbage collected seems to change the outcome.

I'm using spaCy 2.1.0, but I've replicated it on 2.1.3 as well.

Info about spaCy

spaCy version: 2.1.0
Platform: Darwin-18.2.0-x86_64-i386-64bit
Python version: 3.7.2
Models: en

The text was updated successfully, but these errors were encountered:

honnibal · 2019-05-03T15:42:00Z

Thanks for the report.

Could you try avoiding the span.as_doc() call? I've had trouble with this before, and it doesn't work the way it was originally intended. Originally I wanted it to be a zero-copy operation, but that didn't work. I'm very suspicious that this could be where the bug is, as it's a quite untested method that had bugs previously.

You might find the serialization code here useful: https:/explosion/spaCy/blob/master/spacy/tokens/_serialize.py This lets you serialize a collection of Doc objects in an efficient format. It should let you then reconstruct your sentences, given the Doc objects.

apuranik1 · 2019-05-03T16:42:37Z

Thanks for the response! In my own testing the code ran fine without the as_doc call, so I'm glad to hear that's the most likely source of the issue. I can work around calling that method.

thomasopsomer · 2019-06-08T19:54:03Z

I confirm there is something happening with the as_doc method :/ The pb seems to be in the Span.to_array method.

I used to do model(span.as_doc()) to apply a model like NER or TextCategorizer on a span but it seems to be breaking now. What would be the best way to apply a model on a span ?

dpraul · 2019-06-26T21:26:49Z

Also running into this issue. Similar to @thomasopsomer, we're trying to run a Matcher on each entity in doc.ents to do some extra processing

dpraul · 2019-06-27T22:26:34Z

Did some more investigation here. It seems that the issue only shows itself when the DependencyParser component is enabled in the pipeline. With it disabled, we aren't able to reproduce the segfault.

EDIT: Did some more digging. Made a replica of Span.as_doc() and removed HEAD from the list of attrs and the segfaults stop don't happen!

svlandeg · 2019-07-15T22:30:41Z

@dpraul : yep, you're right. I tried fixing this in #3969: instead of simply removing all head information from the span (doc), we can just keep the ones that refer to tokens inside the span.
So I am hopeful that this PR would fix your issue (and others above), too.

lock · 2019-08-22T21:42:28Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added bug Bugs and behaviour differing from documentation feat / doc Feature: Doc, Span and Token objects labels May 3, 2019

honnibal changed the title ~~Segmentation fault when parsing documents with custom tokenization and sentence boundaries~~ Segmentation fault when using span.as_doc() method May 11, 2019

JohnStuartRutledge mentioned this issue Jul 13, 2019

Doc.to_json() method fails when called on doc generated via Span.as_doc() #3962

Closed

svlandeg mentioned this issue Jul 15, 2019

Fix dependency copy for as_doc #3969

Merged

3 tasks

ines closed this as completed Jul 23, 2019

lock bot locked as resolved and limited conversation to collaborators Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault when using span.as_doc() method #3669

Segmentation fault when using span.as_doc() method #3669

apuranik1 commented May 3, 2019

honnibal commented May 3, 2019

apuranik1 commented May 3, 2019

thomasopsomer commented Jun 8, 2019 •

edited

Loading

dpraul commented Jun 26, 2019

dpraul commented Jun 27, 2019 •

edited

Loading

svlandeg commented Jul 15, 2019 •

edited

Loading

lock bot commented Aug 22, 2019

Segmentation fault when using span.as_doc() method #3669

Segmentation fault when using span.as_doc() method #3669

Comments

apuranik1 commented May 3, 2019

How to reproduce the behaviour

Info about spaCy

honnibal commented May 3, 2019

apuranik1 commented May 3, 2019

thomasopsomer commented Jun 8, 2019 • edited Loading

dpraul commented Jun 26, 2019

dpraul commented Jun 27, 2019 • edited Loading

svlandeg commented Jul 15, 2019 • edited Loading

lock bot commented Aug 22, 2019

thomasopsomer commented Jun 8, 2019 •

edited

Loading

dpraul commented Jun 27, 2019 •

edited

Loading

svlandeg commented Jul 15, 2019 •

edited

Loading