Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False positive matches on subsequent OPs in the Matcher patterns #3879

Closed
marina-sp opened this issue Jun 25, 2019 · 5 comments
Closed

False positive matches on subsequent OPs in the Matcher patterns #3879

marina-sp opened this issue Jun 25, 2019 · 5 comments
Labels
bug Bugs and behaviour differing from documentation feat / matcher Feature: Token, phrase and dependency matcher

Comments

@marina-sp
Copy link

Hello, I'm working with spacy a lot, and I especially like the spacy.matchers! But I think I found something that is not working properly (yet).

If a pattern contains a pattern element with a specified OP followed by another optional pattern element with OP=?, then all other restrictions one the first element (like ORTH or REGEX) are ignored.

How to reproduce


import spacy
import de_core_news_sm
from spacy.matcher import Matcher

nlp = de_core_news_sm.load()

text = 'Das ist ein Test.'
doc = nlp(text)
assert(len(doc) == 5)

pattern1 = [
    {'ORTH': 'Das', 'OP': '?'}, 
    {'OP': '?'},
    {'ORTH': 'Test'}
]
matcher = Matcher(nlp.vocab)
matcher.add('rule1', None, pattern1)
matches = matcher(doc) 
assert(len(matches) == 2)   # fails because of a FP match 'ist ein Test'
False positives appear on other OPs as well:
pattern2 = [
    {'ORTH': 'Das', 'OP': '+'}, 
    {'OP': '?'},
    {'ORTH': 'Test'}
]
matcher = Matcher(nlp.vocab)
matcher.add('rule1', None, pattern2)
matches = matcher(doc) 
assert(len(matches) == 0)   # fails because of a FP match 'Das ist ein Test'
pattern3 = [
    {'ORTH': 'Das', 'OP': '*'}, 
    {'OP': '?'},
    {'ORTH': 'Test'}
]
matcher = Matcher(nlp.vocab)
matcher.add('rule1', None, pattern3)
matches = matcher(doc) 
assert(len(matches) == 2)   # fails because of a 2 FP matches 'Das ist ein Test' and 'ist ein Test'

The ? operator seems to respect ORTH restrictions normally, when there is only a single operator:

pattern4 = [
    {'ORTH': 'Das', 'OP': '?'}, 
    {'ORTH': 'Test'}
]
matcher = Matcher(nlp.vocab)
matcher.add('rule1', None, pattern4)
matches = matcher(doc) 
assert(len(matches) == 1) 

My Environment

  • spaCy version: 2.1.4
  • Platform: Windows-10-10.0.17763-SP0
  • Python version: 3.7.3

I hope this is helpful and you will be able to look into it. Thank you in advance!

@ines ines added bug Bugs and behaviour differing from documentation feat / matcher Feature: Token, phrase and dependency matcher labels Jun 26, 2019
@honnibal
Copy link
Member

Thanks for the report. I think this is the same bug as #3839. Merging this with that issue.

@paulrinckens
Copy link

It seems as the fix that fixed #3839, did not fix this issue.
The above snippet still fails with v2.1.6

svlandeg added a commit to svlandeg/spaCy that referenced this issue Aug 8, 2019
@svlandeg
Copy link
Member

svlandeg commented Aug 8, 2019

@paulrinckens : you're right, the matcher still produces one false positive match by misinterpreting the first token of the pattern. Reopening and wrote a unit test with @marina-sp's example.

@svlandeg svlandeg reopened this Aug 8, 2019
ines pushed a commit that referenced this issue Aug 20, 2019
* failing unit test for Issue #3879

* mark test as failing
@ines
Copy link
Member

ines commented Aug 20, 2019

Merging this with #4154!

@ines ines closed this as completed Aug 20, 2019
@lock
Copy link

lock bot commented Sep 19, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / matcher Feature: Token, phrase and dependency matcher
Projects
None yet
Development

No branches or pull requests

5 participants