Matcher behavior with * quantifier #3009

mehmetilker · 2018-12-04T19:25:00Z

How to reproduce the behaviour

When I run following code I expect no match but I get "have probably done things we look" which is probably match from first rule.
If I move 1. rule to the last I see no mach which is expected behaviour.

Problem is similar to this one: #2005
According to that issue I shouldn't see any match with version 2.1.0 but same problem there as well as with version 2.0.18

I have tried to implement matcher2 as stated here :#1971 (comment)
But I got: ModuleNotFoundError: No module named 'spacy.matcher2'

Did I interpret something wrong or it is another case for * quantifier needs to be handled?

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

matcher.add('1', None, *[[{'LEMMA': 'have'}, {'TAG': 'DT', 'OP': '?'}, {'TAG': 'PRP$', 'OP': '?'}, {'LOWER': 'look'}]])

matcher.add('2', None, *[[{'LEMMA': 'have'}, {'IS_ASCII': True,
                                              'IS_PUNCT': False, 'OP': '*'}, {'LEMMA': 'in'}, {'LOWER': 'mind'}]])
matcher.add('3', None, *[[{'LEMMA': 'have'}, {'IS_ASCII': True,
                                              'IS_PUNCT': False, 'OP': '*'}, {'LEMMA': 'it'}, {'LOWER': 'away'}]])
matcher.add('4', None, *[[{'LEMMA': 'have'}, {'IS_ASCII': True,
                                              'IS_PUNCT': False, 'OP': '*'}, {'LEMMA': 'it'}, {'LOWER': 'coming'}]])
matcher.add('5', None, *[[{'LEMMA': 'have'}, {'IS_ASCII': True,
                                              'IS_PUNCT': False, 'OP': '*'}, {'LEMMA': 'it'}, {'LOWER': 'off'}]])
matcher.add('6', None, *[[{'LEMMA': 'have'}, {'IS_ASCII': True,
                                              'IS_PUNCT': False, 'OP': '*'}, {'LEMMA': 'the'}, {'LOWER': 'best'}]])


doc = nlp(
    "And people generally in high school, I think all of us have probably done things we look back on in high school and regret or cringe a bit."
)
matches = matcher(doc)
print('\n\n')
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]  # the matched span
    print(match_id, string_id, start, end, span.text)

Info about spaCy

spaCy version: 2.1.0a3
Platform: Windows-10-10.0.17134-SP0
Python version: 3.6.5
Models: en_core_web_sm

mehmetilker · 2018-12-04T19:29:30Z

Another case here (I think related) and similar or exact same problem with this one:
#2464

All 3 rules should match but only 1. and 2. rules are working.
If I change 'IS_ASCII': True to False 3. rule matches as well.
Same result with 2.0.18 and 2.1.0.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

matcher.add('1', None, *[[{'LEMMA': 'have'},
                          {'LOWER': 'to'}, {'LOWER': 'do'}, {'POS': 'ADP'}]])

matcher.add('2', None, *[[{'LEMMA': 'have'},
                          {'IS_ASCII': True, 'IS_PUNCT': False, 'OP': '*'},
                          {'LOWER': 'to'}, {'LOWER': 'do'}, {'POS': 'ADP'}]])

matcher.add('3', None, *[[{'LEMMA': 'have'},
                          {'IS_ASCII': True, 'IS_PUNCT': False, 'OP': '?'},
                          {'LOWER': 'to'}, {'LOWER': 'do'}, {'POS': 'ADP'}]])

doc = nlp(
    "Some of it also has to do with rising US interest rates, a stronger dollar, and a firm economy that's supporting earnings growth."
)
matches = matcher(doc)
print('\n\n')
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]  # the matched span
    print(match_id, string_id, span.text)

honnibal · 2018-12-06T14:34:02Z

Thanks, definitely seems like a bug.

The ? quantifier indicates a token may occur zero or one times. If the token pattern fit, the matcher would fail to consider valid matches where the token pattern did not fit. Consider a simple regex like: .?b If we have the string 'b', the .? part will fit --- but then the 'b' in the pattern will not fit, leaving us with no match. The same bug left us with too few matches in some cases. For instance, consider: .?.? If we have a string of length two, like 'ab', we actually have three possible matches here: [a, b, ab]. We were only recovering 'ab'. This should now be fixed. Note that the fix also uncovered another bug, where we weren't deduplicating the matches. There are actually two ways we might match 'a' and two ways we might match 'b': as the second token of the pattern, or as the first token of the pattern. This ambiguity is spurious, so we need to deduplicate. Closes #2464 and #3009

* Add failing test for matcher bug #3009 * Deduplicate matches from Matcher * Update matcher ? quantifier test * Fix bug with ? quantifier in Matcher The ? quantifier indicates a token may occur zero or one times. If the token pattern fit, the matcher would fail to consider valid matches where the token pattern did not fit. Consider a simple regex like: .?b If we have the string 'b', the .? part will fit --- but then the 'b' in the pattern will not fit, leaving us with no match. The same bug left us with too few matches in some cases. For instance, consider: .?.? If we have a string of length two, like 'ab', we actually have three possible matches here: [a, b, ab]. We were only recovering 'ab'. This should now be fixed. Note that the fix also uncovered another bug, where we weren't deduplicating the matches. There are actually two ways we might match 'a' and two ways we might match 'b': as the second token of the pattern, or as the first token of the pattern. This ambiguity is spurious, so we need to deduplicate. Closes #2464 and #3009 * Fix Python2

honnibal · 2018-12-29T15:19:19Z

Fixed! 🎉

lock · 2019-01-28T16:05:31Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the bug Bugs and behaviour differing from documentation label Dec 6, 2018

ines added feat / matcher Feature: Token, phrase and dependency matcher 🌙 nightly Discussion and contributions related to nightly builds labels Dec 6, 2018

honnibal added a commit that referenced this issue Dec 29, 2018

Add failing test for matcher bug #3009

4d0e295

honnibal mentioned this issue Dec 29, 2018

Fix behaviour of Matcher's ? quantifier for v2.1 #3105

Merged

honnibal closed this as completed Dec 29, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matcher behavior with * quantifier #3009

Matcher behavior with * quantifier #3009

mehmetilker commented Dec 4, 2018

mehmetilker commented Dec 4, 2018

honnibal commented Dec 6, 2018

honnibal commented Dec 29, 2018

lock bot commented Jan 28, 2019

Matcher behavior with * quantifier #3009

Matcher behavior with * quantifier #3009

Comments

mehmetilker commented Dec 4, 2018

How to reproduce the behaviour

Info about spaCy

mehmetilker commented Dec 4, 2018

honnibal commented Dec 6, 2018

honnibal commented Dec 29, 2018

lock bot commented Jan 28, 2019