Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matcher behavior with * quantifier #3009

Closed
mehmetilker opened this issue Dec 4, 2018 · 4 comments
Closed

Matcher behavior with * quantifier #3009

mehmetilker opened this issue Dec 4, 2018 · 4 comments
Labels
bug Bugs and behaviour differing from documentation feat / matcher Feature: Token, phrase and dependency matcher 🌙 nightly Discussion and contributions related to nightly builds

Comments

@mehmetilker
Copy link

How to reproduce the behaviour

When I run following code I expect no match but I get "have probably done things we look" which is probably match from first rule.
If I move 1. rule to the last I see no mach which is expected behaviour.

Problem is similar to this one: #2005
According to that issue I shouldn't see any match with version 2.1.0 but same problem there as well as with version 2.0.18

I have tried to implement matcher2 as stated here :#1971 (comment)
But I got: ModuleNotFoundError: No module named 'spacy.matcher2'

Did I interpret something wrong or it is another case for * quantifier needs to be handled?

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

matcher.add('1', None, *[[{'LEMMA': 'have'}, {'TAG': 'DT', 'OP': '?'}, {'TAG': 'PRP$', 'OP': '?'}, {'LOWER': 'look'}]])

matcher.add('2', None, *[[{'LEMMA': 'have'}, {'IS_ASCII': True,
                                              'IS_PUNCT': False, 'OP': '*'}, {'LEMMA': 'in'}, {'LOWER': 'mind'}]])
matcher.add('3', None, *[[{'LEMMA': 'have'}, {'IS_ASCII': True,
                                              'IS_PUNCT': False, 'OP': '*'}, {'LEMMA': 'it'}, {'LOWER': 'away'}]])
matcher.add('4', None, *[[{'LEMMA': 'have'}, {'IS_ASCII': True,
                                              'IS_PUNCT': False, 'OP': '*'}, {'LEMMA': 'it'}, {'LOWER': 'coming'}]])
matcher.add('5', None, *[[{'LEMMA': 'have'}, {'IS_ASCII': True,
                                              'IS_PUNCT': False, 'OP': '*'}, {'LEMMA': 'it'}, {'LOWER': 'off'}]])
matcher.add('6', None, *[[{'LEMMA': 'have'}, {'IS_ASCII': True,
                                              'IS_PUNCT': False, 'OP': '*'}, {'LEMMA': 'the'}, {'LOWER': 'best'}]])


doc = nlp(
    "And people generally in high school, I think all of us have probably done things we look back on in high school and regret or cringe a bit."
)
matches = matcher(doc)
print('\n\n')
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]  # the matched span
    print(match_id, string_id, start, end, span.text)

Info about spaCy

  • spaCy version: 2.1.0a3
  • Platform: Windows-10-10.0.17134-SP0
  • Python version: 3.6.5
  • Models: en_core_web_sm
@mehmetilker
Copy link
Author

Another case here (I think related) and similar or exact same problem with this one:
#2464

All 3 rules should match but only 1. and 2. rules are working.
If I change 'IS_ASCII': True to False 3. rule matches as well.
Same result with 2.0.18 and 2.1.0.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

matcher.add('1', None, *[[{'LEMMA': 'have'},
                          {'LOWER': 'to'}, {'LOWER': 'do'}, {'POS': 'ADP'}]])

matcher.add('2', None, *[[{'LEMMA': 'have'},
                          {'IS_ASCII': True, 'IS_PUNCT': False, 'OP': '*'},
                          {'LOWER': 'to'}, {'LOWER': 'do'}, {'POS': 'ADP'}]])

matcher.add('3', None, *[[{'LEMMA': 'have'},
                          {'IS_ASCII': True, 'IS_PUNCT': False, 'OP': '?'},
                          {'LOWER': 'to'}, {'LOWER': 'do'}, {'POS': 'ADP'}]])

doc = nlp(
    "Some of it also has to do with rising US interest rates, a stronger dollar, and a firm economy that's supporting earnings growth."
)
matches = matcher(doc)
print('\n\n')
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]  # the matched span
    print(match_id, string_id, span.text)

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Dec 6, 2018
@honnibal
Copy link
Member

honnibal commented Dec 6, 2018

Thanks, definitely seems like a bug.

@ines ines added feat / matcher Feature: Token, phrase and dependency matcher 🌙 nightly Discussion and contributions related to nightly builds labels Dec 6, 2018
honnibal added a commit that referenced this issue Dec 29, 2018
honnibal added a commit that referenced this issue Dec 29, 2018
The ? quantifier indicates a token may occur zero or one times. If the
token pattern fit, the matcher would fail to consider valid matches
where the token pattern did not fit. Consider a simple regex like:

.?b

If we have the string 'b', the .? part will fit --- but then the 'b' in
the pattern will not fit, leaving us with no match. The same bug left us
with too few matches in some cases. For instance, consider:

.?.?

If we have a string of length two, like 'ab', we actually have three
possible matches here: [a, b, ab]. We were only recovering 'ab'. This
should now be fixed. Note that the fix also uncovered another bug, where
we weren't deduplicating the matches. There are actually two ways we
might match 'a' and two ways we might match 'b': as the second token of the pattern,
or as the first token of the pattern. This ambiguity is spurious, so we
need to deduplicate.

Closes #2464 and #3009
honnibal added a commit that referenced this issue Dec 29, 2018
* Add failing test for matcher bug #3009

* Deduplicate matches from Matcher

* Update matcher ? quantifier test

* Fix bug with ? quantifier in Matcher

The ? quantifier indicates a token may occur zero or one times. If the
token pattern fit, the matcher would fail to consider valid matches
where the token pattern did not fit. Consider a simple regex like:

.?b

If we have the string 'b', the .? part will fit --- but then the 'b' in
the pattern will not fit, leaving us with no match. The same bug left us
with too few matches in some cases. For instance, consider:

.?.?

If we have a string of length two, like 'ab', we actually have three
possible matches here: [a, b, ab]. We were only recovering 'ab'. This
should now be fixed. Note that the fix also uncovered another bug, where
we weren't deduplicating the matches. There are actually two ways we
might match 'a' and two ways we might match 'b': as the second token of the pattern,
or as the first token of the pattern. This ambiguity is spurious, so we
need to deduplicate.

Closes #2464 and #3009

* Fix Python2
@honnibal
Copy link
Member

Fixed! 🎉

@lock
Copy link

lock bot commented Jan 28, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / matcher Feature: Token, phrase and dependency matcher 🌙 nightly Discussion and contributions related to nightly builds
Projects
None yet
Development

No branches or pull requests

3 participants