Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with several optional rule in Token Matcher #3951

Closed
alteest opened this issue Jul 11, 2019 · 12 comments
Closed

Issue with several optional rule in Token Matcher #3951

alteest opened this issue Jul 11, 2019 · 12 comments
Labels
bug Bugs and behaviour differing from documentation feat / matcher Feature: Token, phrase and dependency matcher

Comments

@alteest
Copy link

alteest commented Jul 11, 2019

In my Token Matcher I use OP '?' and if I use it once in rule it works, but is I use it several times it doesn't.
With pattern like : pattern = [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {'LENGTH': {'>': 1}, 'OP': '?'}, {"LOWER": "world"}]

I expect to match only : "Hello world", "Hello my world" or "Hello my world" (or "Hello world")
But it also match phrase "Hello this small world", so this is an issue because I want to have any token only once between "hello" and "world" with or without token "my".

But, for example if I use rule: pattern = [{"LOWER": "hello"}, {'LENGTH': {'>': 1}, 'OP': '?'}, {"LOWER": "world"}]
It works well with phrase like: "Hello world"

And rule : pattern = [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {"LOWER": "world"}] also properly match both phrases : "Hello world" and "Hello my world"

So, I see only issue when we use rule {"OP": "?"} one by one

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {'LENGTH': {'>': 1}, 'OP': '?'},  {"LOWER": "world"}]
#pattern = [{"LOWER": "hello"}, {'LENGTH': {'>': 1}, 'OP': '?'},  {"LOWER": "world"}]
#pattern = [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {"LOWER": "world"}]
matcher.add("HelloWorld", None, pattern)


for text in ("Hello world", "Hello my world", "Hello big world", "Hello this small world"):
    doc = nlp(text)
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = doc[start:end]  # The matched span
        print(text, match_id, string_id, start, end, span.text)
  • spaCy version: 2.1.3
  • Platform: Linux-4.15.0-1037-azure-x86_64-with-debian-stretch-sid
  • Python version: 3.6.0
  • Models: de, en, fr
@BreakBB
Copy link
Contributor

BreakBB commented Jul 12, 2019

If I understand you correctly you want to match "Hello" and "world" and strings where any token is between those words.
If that is the case you can use wildcards and simply make them optional.

This is what I get with spaCy 2.1.4:

pattern = [{"LOWER": "hello"}, {"OP": "?"},  {"LOWER": "world"}]

# Hello world 15578876784678163569 HelloWorld 0 2 Hello world
# Hello my world 15578876784678163569 HelloWorld 0 3 Hello my world
# Hello big world 15578876784678163569 HelloWorld 0 3 Hello big world

@alteest
Copy link
Author

alteest commented Jul 12, 2019

No.
I want to match:
Hello world
Hello world
Hello my world
(word 'my' can be optional, also any one (and only one) word can be optional)

And not match, for example
Hello world

I tried to use this rule : pattern = [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {'LENGTH': {'>': 1}, 'OP': '?'}, {"LOWER": "world"}]
But it doesn't work.

@BreakBB
Copy link
Contributor

BreakBB commented Jul 12, 2019

This might be a formatting issue, but I still don't get what strings you want to match and which not, excuse my misunderstanding.

Cause I don't see any difference in:

I want to match:
Hello world
Hello world

and

And not match, for example
Hello world

pattern = [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {'LENGTH': {'>': 1}, 'OP': '?'}, {"LOWER": "world"}]

will match any combination of "hello" and "world" where "my" can follow "hello" and any token that has more than 1 character can come before "world". So the matched string can have at most 4 tokens and at least 2 (with "hello world"). These strings will match:

  • hello world
  • hello my world
  • hello my big world
  • hello big world

but these will not match:

  • hello my really big world
  • hello I world
  • hello my I world

@alteest
Copy link
Author

alteest commented Jul 12, 2019

Sorry, some of my symbols was hidden (I think due html formatting)

I want to match:
Hello world
Hello any_word_here world
Hello my any_word_here world
(word 'my' can be optional, also any one (and only one) word can be optional)

And not match, for example
Hello not_word_my any_word_here world

For example match:
Hello my big world
Hello beautiful world

But not match

Hello this big world
Hello such small world

@BreakBB
Copy link
Contributor

BreakBB commented Jul 12, 2019

Okay, now I get what you want to achieve and I even see your issues. A simple solution for you is to not try to put all the rules in one pattern but simply add multiple.

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {"LOWER": "world"}])
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {}, {"LOWER": "world"}])
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {}, {"LOWER": "world"}])

for text in ("Hello world", "Hello my world", "Hello my big world", "Hello big world", "Hello this small world"):
    doc = nlp(text)
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = doc[start:end]  # The matched span
        print(text, match_id, string_id, start, end, span.text)

# Matches are:
# Hello world 15578876784678163569 HelloWorld 0 2 Hello world
# Hello my world 15578876784678163569 HelloWorld 0 3 Hello my world
# Hello my big world 15578876784678163569 HelloWorld 0 4 Hello my big world
# Hello big world 15578876784678163569 HelloWorld 0 3 Hello big world

So only "Hello this small world" will not be matched.

@ines ines added feat / matcher Feature: Token, phrase and dependency matcher usage General spaCy usage labels Jul 12, 2019
@alteest
Copy link
Author

alteest commented Jul 12, 2019

Well. actually it's not acceptable. What to do if I have several optional words in rule?
something like [{'LOWER': 'word1', 'OP': '?'}, {'LOWER': 'word2', 'OP': '?'}, {'LOWER': 'word2', 'OP': '?'}, etc] several times. For example 5 or event 10.
In such case I should create all possible combinations. Huge number of rules!!!

@BreakBB
Copy link
Contributor

BreakBB commented Jul 12, 2019

Okay, so I did some more testing around this and I think there is a bug around this.

If you know what words could come and want them to be explicit, you can flawlessly add them one after another with

[{"LOWER": "hello"}, {"LOWER": "my", "OP": "?"}, {"LOWER": "new", "OP": "?"}, {"LOWER": "big", "OP": "?"}, {"LOWER": "world"}]

This way "my", "new" and "big" could be between "hello" and "world". This would match string where any of those words are present and just those word.

But the problem/bug you're facing in your situation is, that you want any token, like a wildcard and that optional. I got misleaded in my previous testing, got confused with what you were expecting and overlooked that your {"LENGTH": {">": 1}, "OP": "?"} rule matches too many tokens. So this really seems like a bug too me. Especially since simple optional wildcard rules work as expected.

@BreakBB
Copy link
Contributor

BreakBB commented Jul 15, 2019

Just to simply reproduce this bug:

import spacy
nlp = spacy.load("en")
matcher = Matcher(nlp.vocab)

pattern = [{"LOWER": "hello"}, {"LOWER": "this", "OP": "?"}, {"OP": "?"}, {"LOWER": "world"}]
matcher.add("Test", None, pattern)
doc = nlp("Hello my new world")
assert len(matcher(doc)) == 0  # Fails, because there is a match

The matcher shouldn't match anything here because the second token "my" should not be matched on the {"LOWER": "this", "OP": "?"} rule.

@honnibal honnibal added bug Bugs and behaviour differing from documentation and removed usage General spaCy usage labels Jul 16, 2019
ines added a commit that referenced this issue Jul 16, 2019
@zrlaida
Copy link

zrlaida commented Aug 12, 2019

Is it possible to know the status on this bug? Is it planned to be fixed in the next release?

@bdewilde
Copy link

bdewilde commented Aug 15, 2019

I think I'm having a similar issue? I have a pattern with two optional tokens followed by one or more required tokens. Instead of getting just the longest continuous matches, I get every possible match. Here's a quick code example:

>>> import spacy
>>> en = spacy.load("en")
>>> matcher = spacy.matcher.Matcher(en.vocab)
>>> pattern = [{'POS': 'DET', 'OP': '?'}, {'POS': 'ADJ', 'OP': '?'}, {'POS': 'NOUN', 'OP': '+'}]
>>> matcher.add("match", None, pattern)
>>> doc = en("The natural language processing pipeline was confusing the poor developer.")
>>> [doc[start : end] for _, start, end in matcher(doc)]
[The natural language,
 natural language,
 language,
 The natural language processing,
 natural language processing,
 language processing,
 processing,
 The natural language processing pipeline,
 natural language processing pipeline,
 language processing pipeline,
 processing pipeline,
 pipeline,
 the poor developer,
 poor developer,
 developer]
  • spaCy version: 2.1.8
  • Platform: Darwin-18.7.0-x86_64-i386-64bit
  • Python version: 3.7.4
  • Models: es, en

polm pushed a commit to polm/spaCy that referenced this issue Aug 18, 2019
@ines
Copy link
Member

ines commented Aug 20, 2019

Merging this with #4154!

@ines ines closed this as completed Aug 20, 2019
@lock
Copy link

lock bot commented Sep 19, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / matcher Feature: Token, phrase and dependency matcher
Projects
None yet
Development

No branches or pull requests

6 participants