-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault when using span.as_doc() method #3669
Comments
Thanks for the report. Could you try avoiding the You might find the serialization code here useful: https:/explosion/spaCy/blob/master/spacy/tokens/_serialize.py This lets you serialize a collection of |
Thanks for the response! In my own testing the code ran fine without the |
I confirm there is something happening with the I used to do |
Also running into this issue. Similar to @thomasopsomer, we're trying to run a |
Did some more investigation here. It seems that the issue only shows itself when the EDIT: Did some more digging. Made a replica of |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behaviour
I am trying to parse a number of reddit comments, dumped from pushshift.io. A file with a sample of these comments (about 130,000) can be found here: comment_sample_100k.csv.
The following script is a heavily shortened version of the one I'm using to do the parsing: spacy_failure.py. There are two custom components in the pipeline: the tokenizer and the sentence boundary setter. The script builds the custom Language object, reads lines in csv format from stdin, parses them, and throws away the result. Running the script looks like:
$ python failure_script.py < comment_sample_100k.csv
More than half the time, this command will eventually segfault. It's nondeterministic - sometimes it crashes in 30 seconds, other times it runs for 10 minutes, and sometimes it finishes successfully.
This is about as short as I could make the script while reproducing the error. Lots of seemingly unrelated lines will prevent the error from occurring. Certain small details that prevent the error are marked with comments. Certain other lines that aren't critical to the error but seem to increase the crash rate are also marked. Occasionally, instead of a segfault, the CSV reader will be corrupted and raise an exception.
My best guess is the issue is related to resource cleanup of the object returned by
sent.as_doc().to_bytes()
. Changing where theSentence
object gets garbage collected seems to change the outcome.I'm using spaCy 2.1.0, but I've replicated it on 2.1.3 as well.
Info about spaCy
The text was updated successfully, but these errors were encountered: