-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with several optional rule in Token Matcher #3951
Comments
If I understand you correctly you want to match "Hello" and "world" and strings where any token is between those words. This is what I get with spaCy 2.1.4: pattern = [{"LOWER": "hello"}, {"OP": "?"}, {"LOWER": "world"}]
# Hello world 15578876784678163569 HelloWorld 0 2 Hello world
# Hello my world 15578876784678163569 HelloWorld 0 3 Hello my world
# Hello big world 15578876784678163569 HelloWorld 0 3 Hello big world |
No. And not match, for example I tried to use this rule : pattern = [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {'LENGTH': {'>': 1}, 'OP': '?'}, {"LOWER": "world"}] |
This might be a formatting issue, but I still don't get what strings you want to match and which not, excuse my misunderstanding. Cause I don't see any difference in:
and
will match any combination of "hello" and "world" where "my" can follow "hello" and any token that has more than 1 character can come before "world". So the matched string can have at most 4 tokens and at least 2 (with "hello world"). These strings will match:
but these will not match:
|
Sorry, some of my symbols was hidden (I think due html formatting) I want to match: And not match, for example For example match: But not match Hello this big world |
Okay, now I get what you want to achieve and I even see your issues. A simple solution for you is to not try to put all the rules in one pattern but simply add multiple. import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {"LOWER": "world"}])
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {}, {"LOWER": "world"}])
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {}, {"LOWER": "world"}])
for text in ("Hello world", "Hello my world", "Hello my big world", "Hello big world", "Hello this small world"):
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(text, match_id, string_id, start, end, span.text)
# Matches are:
# Hello world 15578876784678163569 HelloWorld 0 2 Hello world
# Hello my world 15578876784678163569 HelloWorld 0 3 Hello my world
# Hello my big world 15578876784678163569 HelloWorld 0 4 Hello my big world
# Hello big world 15578876784678163569 HelloWorld 0 3 Hello big world So only "Hello this small world" will not be matched. |
Well. actually it's not acceptable. What to do if I have several optional words in rule? |
Okay, so I did some more testing around this and I think there is a bug around this. If you know what words could come and want them to be explicit, you can flawlessly add them one after another with [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {"LOWER": "new", "OP": "?"}, {"LOWER": "big", "OP": "?"}, {"LOWER": "world"}] This way "my", "new" and "big" could be between "hello" and "world". This would match string where any of those words are present and just those word. But the problem/bug you're facing in your situation is, that you want any token, like a wildcard and that optional. I got misleaded in my previous testing, got confused with what you were expecting and overlooked that your |
Just to simply reproduce this bug: import spacy
nlp = spacy.load("en")
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "hello"}, {"LOWER": "this", "OP": "?"}, {"OP": "?"}, {"LOWER": "world"}]
matcher.add("Test", None, pattern)
doc = nlp("Hello my new world")
assert len(matcher(doc)) == 0 # Fails, because there is a match The matcher shouldn't match anything here because the second token "my" should not be matched on the |
Is it possible to know the status on this bug? Is it planned to be fixed in the next release? |
I think I'm having a similar issue? I have a pattern with two optional tokens followed by one or more required tokens. Instead of getting just the longest continuous matches, I get every possible match. Here's a quick code example: >>> import spacy
>>> en = spacy.load("en")
>>> matcher = spacy.matcher.Matcher(en.vocab)
>>> pattern = [{'POS': 'DET', 'OP': '?'}, {'POS': 'ADJ', 'OP': '?'}, {'POS': 'NOUN', 'OP': '+'}]
>>> matcher.add("match", None, pattern)
>>> doc = en("The natural language processing pipeline was confusing the poor developer.")
>>> [doc[start : end] for _, start, end in matcher(doc)]
[The natural language,
natural language,
language,
The natural language processing,
natural language processing,
language processing,
processing,
The natural language processing pipeline,
natural language processing pipeline,
language processing pipeline,
processing pipeline,
pipeline,
the poor developer,
poor developer,
developer]
|
Merging this with #4154! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
In my Token Matcher I use OP '?' and if I use it once in rule it works, but is I use it several times it doesn't.
With pattern like : pattern = [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {'LENGTH': {'>': 1}, 'OP': '?'}, {"LOWER": "world"}]
I expect to match only : "Hello world", "Hello my world" or "Hello my world" (or "Hello world")
But it also match phrase "Hello this small world", so this is an issue because I want to have any token only once between "hello" and "world" with or without token "my".
But, for example if I use rule: pattern = [{"LOWER": "hello"}, {'LENGTH': {'>': 1}, 'OP': '?'}, {"LOWER": "world"}]
It works well with phrase like: "Hello world"
And rule : pattern = [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {"LOWER": "world"}] also properly match both phrases : "Hello world" and "Hello my world"
So, I see only issue when we use rule {"OP": "?"} one by one
The text was updated successfully, but these errors were encountered: