Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PhraseMatcher.remove throws an error if there are duplicates in the list of phrases #4435

Closed
slartibaartfast opened this issue Oct 13, 2019 · 4 comments · Fixed by #4437
Closed
Labels
bug Bugs and behaviour differing from documentation feat / matcher Feature: Token, phrase and dependency matcher

Comments

@slartibaartfast
Copy link

How to reproduce the behaviour

Removing phrases from the PhraseMatcher causes an error when one of the phrases exists as part of another phrase. Sometimes, with long lists of phrases, it causes a segmentation fault. This snippet might reproduce the error (it does for me):

import spacy
from spacy.matcher import PhraseMatcher, Matcher

nlp = spacy.load("en_core_web_lg")
phrase_matcher = PhraseMatcher(nlp.vocab)
phrases = ["this is a pig", "this is a sheep", "this is a", "this is a dog"]
patterns = [nlp(phrase) for phrase in phrases]
phrase_matcher.add("animals", None, *patterns)
print("removing animals")
phrase_matcher.remove("animals")

will throw this error:

removing animals
Traceback (most recent call last):
  File "spacytest.py", line 10, in <module>
    phrase_matcher.remove("animals")
  File "phrasematcher.pyx", line 136, in spacy.matcher.phrasematcher.PhraseMatcher.remove
  File "cymem.pyx", line 102, in cymem.cymem.Pool.free
KeyError: 140393018962904

Your Environment

  • spaCy version: 2.2.1
  • Platform: Linux-5.0.0-31-generic-x86_64-with-Ubuntu-19.04-disco
  • Python version: 3.7.3
@svlandeg svlandeg added bug Bugs and behaviour differing from documentation feat / matcher Feature: Token, phrase and dependency matcher labels Oct 13, 2019
@svlandeg
Copy link
Member

Thanks for the report and the code example! Can confirm the erratic behaviour - this certainly looks like a bug. We'll look into it.

@slartibaartfast
Copy link
Author

My pleasure. I'm glad you are looking into it.

I'm trying to workaround it, but end up manually making a second list of the purged phrases. This will purge the duplicates...

import re
def find_duplicate_phrases(phrases):
    """
    Search a list of phrases for duplicate sub strings and return a clean list
    """
    for outer_phrase in phrases:
        for inner_phrase in phrases:
            if re.search(outer_phrase, inner_phrase ) is not None:
                if inner_phrase != outer_phrase:
                    # get the shortest, least specific phrase
                    if len(inner_phrase) < len(outer_phrase):
                        if inner_phrase in phrases:
                            index = phrases.index(inner_phrase)
                            del phrases[index]
                    elif len(outer_phrase) < len(inner_phrase):
                        if outer_phrase in phrases:
                            index = phrases.index(outer_phrase)
                            del phrases[index]
                    else:
                        index = phrases.index(outer_phrase)
                        del phrases[index]
    return phrases

@adrianeboyd
Copy link
Contributor

I'm not sure there's a good workaround even if you filter the phrases a bit, since there were a couple things wrong with how it removed multiple phrases for the same match ID. The only (still rather hacky) workaround that I think could work would be to add each phrase with a separate match ID (like "ANIMAL_sheep" or "ANIMAL_dog") and filter/reduce the match IDs after matching.

Thanks again for the bug report! Hopefully this patch will fix the problem for the next minor release.

@lock
Copy link

lock bot commented Nov 13, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Nov 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / matcher Feature: Token, phrase and dependency matcher
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants