Issue with several optional rule in Token Matcher #3951

alteest · 2019-07-11T11:49:30Z

In my Token Matcher I use OP '?' and if I use it once in rule it works, but is I use it several times it doesn't.
With pattern like : pattern = [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {'LENGTH': {'>': 1}, 'OP': '?'}, {"LOWER": "world"}]

I expect to match only : "Hello world", "Hello my world" or "Hello my world" (or "Hello world")
But it also match phrase "Hello this small world", so this is an issue because I want to have any token only once between "hello" and "world" with or without token "my".

But, for example if I use rule: pattern = [{"LOWER": "hello"}, {'LENGTH': {'>': 1}, 'OP': '?'}, {"LOWER": "world"}]
It works well with phrase like: "Hello world"

And rule : pattern = [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {"LOWER": "world"}] also properly match both phrases : "Hello world" and "Hello my world"

So, I see only issue when we use rule {"OP": "?"} one by one

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {'LENGTH': {'>': 1}, 'OP': '?'},  {"LOWER": "world"}]
#pattern = [{"LOWER": "hello"}, {'LENGTH': {'>': 1}, 'OP': '?'},  {"LOWER": "world"}]
#pattern = [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {"LOWER": "world"}]
matcher.add("HelloWorld", None, pattern)


for text in ("Hello world", "Hello my world", "Hello big world", "Hello this small world"):
    doc = nlp(text)
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = doc[start:end]  # The matched span
        print(text, match_id, string_id, start, end, span.text)

spaCy version: 2.1.3
Platform: Linux-4.15.0-1037-azure-x86_64-with-debian-stretch-sid
Python version: 3.6.0
Models: de, en, fr

The text was updated successfully, but these errors were encountered:

BreakBB · 2019-07-12T07:50:33Z

If I understand you correctly you want to match "Hello" and "world" and strings where any token is between those words.
If that is the case you can use wildcards and simply make them optional.

This is what I get with spaCy 2.1.4:

pattern = [{"LOWER": "hello"}, {"OP": "?"},  {"LOWER": "world"}]

# Hello world 15578876784678163569 HelloWorld 0 2 Hello world
# Hello my world 15578876784678163569 HelloWorld 0 3 Hello my world
# Hello big world 15578876784678163569 HelloWorld 0 3 Hello big world

alteest · 2019-07-12T08:14:34Z

No.
I want to match:
Hello world
Hello world
Hello my world
(word 'my' can be optional, also any one (and only one) word can be optional)

And not match, for example
Hello world

I tried to use this rule : pattern = [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {'LENGTH': {'>': 1}, 'OP': '?'}, {"LOWER": "world"}]
But it doesn't work.

BreakBB · 2019-07-12T08:34:29Z

This might be a formatting issue, but I still don't get what strings you want to match and which not, excuse my misunderstanding.

Cause I don't see any difference in:

I want to match:
Hello world
Hello world

and

And not match, for example
Hello world

pattern = [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {'LENGTH': {'>': 1}, 'OP': '?'}, {"LOWER": "world"}]

will match any combination of "hello" and "world" where "my" can follow "hello" and any token that has more than 1 character can come before "world". So the matched string can have at most 4 tokens and at least 2 (with "hello world"). These strings will match:

hello world
hello my world
hello my big world
hello big world

but these will not match:

hello my really big world
hello I world
hello my I world

alteest · 2019-07-12T08:51:52Z

Sorry, some of my symbols was hidden (I think due html formatting)

I want to match:
Hello world
Hello any_word_here world
Hello my any_word_here world
(word 'my' can be optional, also any one (and only one) word can be optional)

And not match, for example
Hello not_word_my any_word_here world

For example match:
Hello my big world
Hello beautiful world

But not match

Hello this big world
Hello such small world

BreakBB · 2019-07-12T09:13:58Z

Okay, now I get what you want to achieve and I even see your issues. A simple solution for you is to not try to put all the rules in one pattern but simply add multiple.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {"LOWER": "world"}])
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {}, {"LOWER": "world"}])
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {}, {"LOWER": "world"}])

for text in ("Hello world", "Hello my world", "Hello my big world", "Hello big world", "Hello this small world"):
    doc = nlp(text)
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = doc[start:end]  # The matched span
        print(text, match_id, string_id, start, end, span.text)

# Matches are:
# Hello world 15578876784678163569 HelloWorld 0 2 Hello world
# Hello my world 15578876784678163569 HelloWorld 0 3 Hello my world
# Hello my big world 15578876784678163569 HelloWorld 0 4 Hello my big world
# Hello big world 15578876784678163569 HelloWorld 0 3 Hello big world

So only "Hello this small world" will not be matched.

alteest · 2019-07-12T12:36:45Z

Well. actually it's not acceptable. What to do if I have several optional words in rule?
something like [{'LOWER': 'word1', 'OP': '?'}, {'LOWER': 'word2', 'OP': '?'}, {'LOWER': 'word2', 'OP': '?'}, etc] several times. For example 5 or event 10.
In such case I should create all possible combinations. Huge number of rules!!!

BreakBB · 2019-07-12T13:48:04Z

Okay, so I did some more testing around this and I think there is a bug around this.

If you know what words could come and want them to be explicit, you can flawlessly add them one after another with

[{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {"LOWER": "new", "OP": "?"}, {"LOWER": "big", "OP": "?"}, {"LOWER": "world"}]

This way "my", "new" and "big" could be between "hello" and "world". This would match string where any of those words are present and just those word.

But the problem/bug you're facing in your situation is, that you want any token, like a wildcard and that optional. I got misleaded in my previous testing, got confused with what you were expecting and overlooked that your {"LENGTH": {">": 1}, "OP": "?"} rule matches too many tokens. So this really seems like a bug too me. Especially since simple optional wildcard rules work as expected.

BreakBB · 2019-07-15T06:34:39Z

Just to simply reproduce this bug:

import spacy
nlp = spacy.load("en")
matcher = Matcher(nlp.vocab)

pattern = [{"LOWER": "hello"}, {"LOWER": "this", "OP": "?"}, {"OP": "?"}, {"LOWER": "world"}]
matcher.add("Test", None, pattern)
doc = nlp("Hello my new world")
assert len(matcher(doc)) == 0  # Fails, because there is a match

The matcher shouldn't match anything here because the second token "my" should not be matched on the {"LOWER": "this", "OP": "?"} rule.

zrlaida · 2019-08-12T15:05:25Z

Is it possible to know the status on this bug? Is it planned to be fixed in the next release?

bdewilde · 2019-08-15T14:08:08Z

I think I'm having a similar issue? I have a pattern with two optional tokens followed by one or more required tokens. Instead of getting just the longest continuous matches, I get every possible match. Here's a quick code example:

>>> import spacy
>>> en = spacy.load("en")
>>> matcher = spacy.matcher.Matcher(en.vocab)
>>> pattern = [{'POS': 'DET', 'OP': '?'}, {'POS': 'ADJ', 'OP': '?'}, {'POS': 'NOUN', 'OP': '+'}]
>>> matcher.add("match", None, pattern)
>>> doc = en("The natural language processing pipeline was confusing the poor developer.")
>>> [doc[start : end] for _, start, end in matcher(doc)]
[The natural language,
 natural language,
 language,
 The natural language processing,
 natural language processing,
 language processing,
 processing,
 The natural language processing pipeline,
 natural language processing pipeline,
 language processing pipeline,
 processing pipeline,
 pipeline,
 the poor developer,
 poor developer,
 developer]

spaCy version: 2.1.8
Platform: Darwin-18.7.0-x86_64-i386-64bit
Python version: 3.7.4
Models: es, en

ines · 2019-08-20T14:42:52Z

Merging this with #4154!

lock · 2019-09-19T15:42:49Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added feat / matcher Feature: Token, phrase and dependency matcher usage General spaCy usage labels Jul 12, 2019

honnibal added bug Bugs and behaviour differing from documentation and removed usage General spaCy usage labels Jul 16, 2019

ines added a commit that referenced this issue Jul 16, 2019

Add regression test for #3951

62ff128

ines mentioned this issue Aug 15, 2019

Matches without a final {OP: ?} token are not returned #4120

Closed

polm pushed a commit to polm/spaCy that referenced this issue Aug 18, 2019

Add regression test for explosion#3951

3748c96

ines mentioned this issue Aug 20, 2019

💫 Matcher issues with 'OP': '?' #4154

Closed

ines closed this as completed Aug 20, 2019

svlandeg mentioned this issue Aug 21, 2019

fix retry loop in matcher #4162

Merged

3 tasks

lock bot locked as resolved and limited conversation to collaborators Sep 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with several optional rule in Token Matcher #3951

Issue with several optional rule in Token Matcher #3951

alteest commented Jul 11, 2019 •

edited

Loading

BreakBB commented Jul 12, 2019

alteest commented Jul 12, 2019

BreakBB commented Jul 12, 2019 •

edited

Loading

alteest commented Jul 12, 2019 •

edited

Loading

BreakBB commented Jul 12, 2019 •

edited

Loading

alteest commented Jul 12, 2019 •

edited

Loading

BreakBB commented Jul 12, 2019

BreakBB commented Jul 15, 2019 •

edited

Loading

zrlaida commented Aug 12, 2019

bdewilde commented Aug 15, 2019 •

edited

Loading

ines commented Aug 20, 2019

lock bot commented Sep 19, 2019

Issue with several optional rule in Token Matcher #3951

Issue with several optional rule in Token Matcher #3951

Comments

alteest commented Jul 11, 2019 • edited Loading

BreakBB commented Jul 12, 2019

alteest commented Jul 12, 2019

BreakBB commented Jul 12, 2019 • edited Loading

alteest commented Jul 12, 2019 • edited Loading

BreakBB commented Jul 12, 2019 • edited Loading

alteest commented Jul 12, 2019 • edited Loading

BreakBB commented Jul 12, 2019

BreakBB commented Jul 15, 2019 • edited Loading

zrlaida commented Aug 12, 2019

bdewilde commented Aug 15, 2019 • edited Loading

ines commented Aug 20, 2019

lock bot commented Sep 19, 2019

alteest commented Jul 11, 2019 •

edited

Loading

BreakBB commented Jul 12, 2019 •

edited

Loading

alteest commented Jul 12, 2019 •

edited

Loading

BreakBB commented Jul 12, 2019 •

edited

Loading

alteest commented Jul 12, 2019 •

edited

Loading

BreakBB commented Jul 15, 2019 •

edited

Loading

bdewilde commented Aug 15, 2019 •

edited

Loading