-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matcher behavior with * quantifier #3009
Comments
Another case here (I think related) and similar or exact same problem with this one: All 3 rules should match but only 1. and 2. rules are working. import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
matcher.add('1', None, *[[{'LEMMA': 'have'},
{'LOWER': 'to'}, {'LOWER': 'do'}, {'POS': 'ADP'}]])
matcher.add('2', None, *[[{'LEMMA': 'have'},
{'IS_ASCII': True, 'IS_PUNCT': False, 'OP': '*'},
{'LOWER': 'to'}, {'LOWER': 'do'}, {'POS': 'ADP'}]])
matcher.add('3', None, *[[{'LEMMA': 'have'},
{'IS_ASCII': True, 'IS_PUNCT': False, 'OP': '?'},
{'LOWER': 'to'}, {'LOWER': 'do'}, {'POS': 'ADP'}]])
doc = nlp(
"Some of it also has to do with rising US interest rates, a stronger dollar, and a firm economy that's supporting earnings growth."
)
matches = matcher(doc)
print('\n\n')
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # get string representation
span = doc[start:end] # the matched span
print(match_id, string_id, span.text) |
Thanks, definitely seems like a bug. |
The ? quantifier indicates a token may occur zero or one times. If the token pattern fit, the matcher would fail to consider valid matches where the token pattern did not fit. Consider a simple regex like: .?b If we have the string 'b', the .? part will fit --- but then the 'b' in the pattern will not fit, leaving us with no match. The same bug left us with too few matches in some cases. For instance, consider: .?.? If we have a string of length two, like 'ab', we actually have three possible matches here: [a, b, ab]. We were only recovering 'ab'. This should now be fixed. Note that the fix also uncovered another bug, where we weren't deduplicating the matches. There are actually two ways we might match 'a' and two ways we might match 'b': as the second token of the pattern, or as the first token of the pattern. This ambiguity is spurious, so we need to deduplicate. Closes #2464 and #3009
* Add failing test for matcher bug #3009 * Deduplicate matches from Matcher * Update matcher ? quantifier test * Fix bug with ? quantifier in Matcher The ? quantifier indicates a token may occur zero or one times. If the token pattern fit, the matcher would fail to consider valid matches where the token pattern did not fit. Consider a simple regex like: .?b If we have the string 'b', the .? part will fit --- but then the 'b' in the pattern will not fit, leaving us with no match. The same bug left us with too few matches in some cases. For instance, consider: .?.? If we have a string of length two, like 'ab', we actually have three possible matches here: [a, b, ab]. We were only recovering 'ab'. This should now be fixed. Note that the fix also uncovered another bug, where we weren't deduplicating the matches. There are actually two ways we might match 'a' and two ways we might match 'b': as the second token of the pattern, or as the first token of the pattern. This ambiguity is spurious, so we need to deduplicate. Closes #2464 and #3009 * Fix Python2
Fixed! 🎉 |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behaviour
When I run following code I expect no match but I get "have probably done things we look" which is probably match from first rule.
If I move 1. rule to the last I see no mach which is expected behaviour.
Problem is similar to this one: #2005
According to that issue I shouldn't see any match with version 2.1.0 but same problem there as well as with version 2.0.18
I have tried to implement matcher2 as stated here :#1971 (comment)
But I got: ModuleNotFoundError: No module named 'spacy.matcher2'
Did I interpret something wrong or it is another case for * quantifier needs to be handled?
Info about spaCy
The text was updated successfully, but these errors were encountered: