Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking sentence attributes of Tokens of unit test Doc causes a segfault sometimes #5075

Closed
swartchris8 opened this issue Mar 1, 2020 · 9 comments
Labels
feat / doc Feature: Doc, Span and Token objects

Comments

@swartchris8
Copy link

swartchris8 commented Mar 1, 2020

I came across this issue while trying to write a complex spacy Doc based unit test. It seems checking the sent attribute of tokens of the doc created below causes a segfault sometimes. In my legacy spacy==2.0.12 dev env it always causes a segfault in the newer 2.2.3 only sometimes.

How to reproduce the behaviour

Run below script 4-10 times and you should get a segfault:

from spacy.tests.util import get_doc
from spacy.vocab import Vocab

# "A simple method is used to compare the efficacy of ropinirole and rasagiline for Parkinson's disease."
words = ["A", "simple method"] + "is used to compare the efficacy of ropinirole and rasagiline for".split() + ["Parkinson's disease"]
vocab_words = """"A simple method is used to compare the efficacy of ropinirole and rasagiline for Parkinson's disease""".split()
vocab = Vocab(strings=vocab_words)
dependencies = [
    "det",  # A
    "nsubjpass",  # simple method
    "auxpass",  # is
    "ROOT",  # used
    "aux",  # to
    "xcomp",  # compare
    "det",  # the
    "dobj",  # efficacy
    "prep",  # of
    "pobj",  # ropinirole
    "cc",  # and
    "conj",  # rasagiline
    "prep",  # for
    "pobj",  # Parkinson's disease
]
tags = ["DT", "JJ", "VBZ", "VBN", "TO", "VB", "DT", "NN", "IN", "NN", "CC", "NN", "IN", "NNP"]
pos_tags = [
    "DET", "ADJ", "VERB", "VERB", "PART", "VERB", "DET", "NOUN", "ADP", "NOUN", "CCONJ", "NOUN", "ADP",
    "PROPN"]
heads = [1, 3, 3, 3, 5, 3, 7, 5, 7, 8, 9, 9, 9, 12]

en_doc = get_doc(vocab, words, pos_tags, heads, dependencies, tags)

for t in en_doc:
    print(t, t.sent)

Your Environment

  • spaCy version: 2.2.3
  • Platform: Darwin-19.3.0-x86_64-i386-64bit
  • Python version: 3.6.8
  • Environment Information: Clean virtualenv with only spacy in it installed with pip install spacy
@adrianeboyd adrianeboyd added the feat / doc Feature: Doc, Span and Token objects label Mar 1, 2020
@adrianeboyd
Copy link
Contributor

adrianeboyd commented Mar 1, 2020

Thanks for the report! I strongly suspect that get_doc() is the source of the problem rather than Token.sent. You could try constructing your doc with Doc rather than get_doc() and see if you run into the same problems. A sketch:

from spacy.tokens import Doc

doc = Doc(vocab, words=words) # add `spaces` if you need it

for t, pos, tag, head, dep in zip(doc, pos_tags, tags, heads, dependencies):
    t.tag_ = tag
    t.pos_ = pos
    t.head = doc[head]
    t.dep_ = dep

doc.is_tagged = True
doc.is_parsed = True

for t in doc:
    print(t, t.sent)

tests.util.get_doc() is a bit buggy in various ways and @svlandeg has just been working on improving it in #5049, not due to segfaults directly (although I've seen the same kind of segfaults), and I think that her modifications should address these problems, too.

@svlandeg
Copy link
Member

svlandeg commented Mar 2, 2020

@adrianeboyd is probably right that get_doc is the culprit here. Unfortunately, it doesn't look like the changes from PR #5049 fix this particular issue :(

When I try to run the original code, I actually get into an infinite loop which is fixed with this edit (which I think should have been there all along?). But after that small fix, and with the changes of PR #5049 merged, I do still run into a segfault :( So we'll have to revise get_doc further, I assume?

@adrianeboyd
Copy link
Contributor

An infinite loop was not the intention there, no. I had made some local changes to get_doc() a while back to address segfaults, but I thought it mainly had to do with attribute arrays with differing lengths, and @svlandeg's code shouldn't allow that, so maybe I had just fixed it enough to get the tests to pass where tokenizer modifications cause problems and I hadn't identified the real issue. I will look into it again...

@adrianeboyd
Copy link
Contributor

I suspect the problem is with the heads. Doc.from_array() expects relative heads rather than absolute heads. This should be detected and caught better.

@svlandeg
Copy link
Member

svlandeg commented Mar 2, 2020

Ah good point! Are you looking into this further or want me to?

@adrianeboyd
Copy link
Contributor

adrianeboyd commented Mar 2, 2020

Adjust the heads:

heads = [1, 3, 3, 3, 5, 3, 7, 5, 7, 8, 9, 9, 9, 12]
for i, head in enumerate(heads):
    heads[i] = head - i

@adrianeboyd
Copy link
Contributor

I can do it. Maybe it's a chance to figure out some of the old bugs with set_children_from_heads(), too.

@svlandeg
Copy link
Member

svlandeg commented Mar 3, 2020

Fixed by PR #5079, which now throws an error pointing to the fact that the heads need to be relative, in this specific case:

heads = [1, 2, 1, 0, 1, -2, 1, -2, -1, -1, -1, -2, -3, -1]

@svlandeg svlandeg closed this as completed Mar 3, 2020
@lock
Copy link

lock bot commented Apr 2, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Apr 2, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / doc Feature: Doc, Span and Token objects
Projects
None yet
Development

No branches or pull requests

3 participants