Checking sentence attributes of Tokens of unit test Doc causes a segfault sometimes #5075

swartchris8 · 2020-03-01T15:11:11Z

I came across this issue while trying to write a complex spacy Doc based unit test. It seems checking the sent attribute of tokens of the doc created below causes a segfault sometimes. In my legacy spacy==2.0.12 dev env it always causes a segfault in the newer 2.2.3 only sometimes.

How to reproduce the behaviour

Run below script 4-10 times and you should get a segfault:

from spacy.tests.util import get_doc
from spacy.vocab import Vocab

# "A simple method is used to compare the efficacy of ropinirole and rasagiline for Parkinson's disease."
words = ["A", "simple method"] + "is used to compare the efficacy of ropinirole and rasagiline for".split() + ["Parkinson's disease"]
vocab_words = """"A simple method is used to compare the efficacy of ropinirole and rasagiline for Parkinson's disease""".split()
vocab = Vocab(strings=vocab_words)
dependencies = [
    "det",  # A
    "nsubjpass",  # simple method
    "auxpass",  # is
    "ROOT",  # used
    "aux",  # to
    "xcomp",  # compare
    "det",  # the
    "dobj",  # efficacy
    "prep",  # of
    "pobj",  # ropinirole
    "cc",  # and
    "conj",  # rasagiline
    "prep",  # for
    "pobj",  # Parkinson's disease
]
tags = ["DT", "JJ", "VBZ", "VBN", "TO", "VB", "DT", "NN", "IN", "NN", "CC", "NN", "IN", "NNP"]
pos_tags = [
    "DET", "ADJ", "VERB", "VERB", "PART", "VERB", "DET", "NOUN", "ADP", "NOUN", "CCONJ", "NOUN", "ADP",
    "PROPN"]
heads = [1, 3, 3, 3, 5, 3, 7, 5, 7, 8, 9, 9, 9, 12]

en_doc = get_doc(vocab, words, pos_tags, heads, dependencies, tags)

for t in en_doc:
    print(t, t.sent)

Your Environment

spaCy version: 2.2.3
Platform: Darwin-19.3.0-x86_64-i386-64bit
Python version: 3.6.8
Environment Information: Clean virtualenv with only spacy in it installed with pip install spacy

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2020-03-01T15:59:48Z

Thanks for the report! I strongly suspect that get_doc() is the source of the problem rather than Token.sent. You could try constructing your doc with Doc rather than get_doc() and see if you run into the same problems. A sketch:

from spacy.tokens import Doc

doc = Doc(vocab, words=words) # add `spaces` if you need it

for t, pos, tag, head, dep in zip(doc, pos_tags, tags, heads, dependencies):
    t.tag_ = tag
    t.pos_ = pos
    t.head = doc[head]
    t.dep_ = dep

doc.is_tagged = True
doc.is_parsed = True

for t in doc:
    print(t, t.sent)

tests.util.get_doc() is a bit buggy in various ways and @svlandeg has just been working on improving it in #5049, not due to segfaults directly (although I've seen the same kind of segfaults), and I think that her modifications should address these problems, too.

svlandeg · 2020-03-02T11:18:48Z

@adrianeboyd is probably right that get_doc is the culprit here. Unfortunately, it doesn't look like the changes from PR #5049 fix this particular issue :(

When I try to run the original code, I actually get into an infinite loop which is fixed with this edit (which I think should have been there all along?). But after that small fix, and with the changes of PR #5049 merged, I do still run into a segfault :( So we'll have to revise get_doc further, I assume?

adrianeboyd · 2020-03-02T11:30:29Z

An infinite loop was not the intention there, no. I had made some local changes to get_doc() a while back to address segfaults, but I thought it mainly had to do with attribute arrays with differing lengths, and @svlandeg's code shouldn't allow that, so maybe I had just fixed it enough to get the tests to pass where tokenizer modifications cause problems and I hadn't identified the real issue. I will look into it again...

adrianeboyd · 2020-03-02T11:43:32Z

I suspect the problem is with the heads. Doc.from_array() expects relative heads rather than absolute heads. This should be detected and caught better.

svlandeg · 2020-03-02T11:45:29Z

Ah good point! Are you looking into this further or want me to?

adrianeboyd · 2020-03-02T11:45:33Z

Adjust the heads:

heads = [1, 3, 3, 3, 5, 3, 7, 5, 7, 8, 9, 9, 9, 12]
for i, head in enumerate(heads):
    heads[i] = head - i

adrianeboyd · 2020-03-02T11:46:14Z

I can do it. Maybe it's a chance to figure out some of the old bugs with set_children_from_heads(), too.

svlandeg · 2020-03-03T22:04:17Z

Fixed by PR #5079, which now throws an error pointing to the fact that the heads need to be relative, in this specific case:

heads = [1, 2, 1, 0, 1, -2, 1, -2, -1, -1, -1, -2, -3, -1]

lock · 2020-04-02T23:35:11Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

adrianeboyd added the feat / doc Feature: Doc, Span and Token objects label Mar 1, 2020

adrianeboyd mentioned this issue Mar 2, 2020

Improve token head verification #5079

Merged

3 tasks

svlandeg closed this as completed Mar 3, 2020

lock bot locked as resolved and limited conversation to collaborators Apr 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checking sentence attributes of Tokens of unit test Doc causes a segfault sometimes #5075

Checking sentence attributes of Tokens of unit test Doc causes a segfault sometimes #5075

swartchris8 commented Mar 1, 2020 •

edited

Loading

adrianeboyd commented Mar 1, 2020 •

edited

Loading

svlandeg commented Mar 2, 2020

adrianeboyd commented Mar 2, 2020

adrianeboyd commented Mar 2, 2020

svlandeg commented Mar 2, 2020

adrianeboyd commented Mar 2, 2020 •

edited

Loading

adrianeboyd commented Mar 2, 2020

svlandeg commented Mar 3, 2020

lock bot commented Apr 2, 2020

Checking sentence attributes of Tokens of unit test Doc causes a segfault sometimes #5075

Checking sentence attributes of Tokens of unit test Doc causes a segfault sometimes #5075

Comments

swartchris8 commented Mar 1, 2020 • edited Loading

How to reproduce the behaviour

Your Environment

adrianeboyd commented Mar 1, 2020 • edited Loading

svlandeg commented Mar 2, 2020

adrianeboyd commented Mar 2, 2020

adrianeboyd commented Mar 2, 2020

svlandeg commented Mar 2, 2020

adrianeboyd commented Mar 2, 2020 • edited Loading

adrianeboyd commented Mar 2, 2020

svlandeg commented Mar 3, 2020

lock bot commented Apr 2, 2020

swartchris8 commented Mar 1, 2020 •

edited

Loading

adrianeboyd commented Mar 1, 2020 •

edited

Loading

adrianeboyd commented Mar 2, 2020 •

edited

Loading