Matcher and EntityRuler token patterns should use same language and use original Token attribute names #4063

johann-petrak · 2019-08-01T11:40:44Z

This is one of the most puzzling and not very well documented details about Spacy:

the Matcher apparently only allows to match a subset of token attributes and requires that instead of the original names of those attributes, special all-uppercase identifiers have to be used. Why? Why not allow to use all Token attributes and by their original name?
The entity ruler examples seem to indicate that lower case, original name attributes can be used. But the documentation does not really say much about this or if other features of the matcher like value ranges, operators and quantifiers can be used.

Ideally the token matching in the entity ruler would work identically to the matcher and both would allow to use all Token attributes with their original name.

BreakBB · 2019-08-01T12:03:13Z

I already had a look at the code about this and the EntityRuler internally creates a Matcher (see here) and possibly a PhraseMatcher (see here) as well.

Moreover the add_patterns method of the EntityRuler delegates the pattern to those matchers (see here). So from my understanding the pattern format should be identical and the examples and docs should be updated.

ines · 2019-08-01T12:05:08Z

Sorry if this was confusing. The EntityRuler doesn't do anything special – its token-based patterns are passed directly to the Matcher. So the attributes they support are the same.

The names being uppercase was more of a convention that was introduced early on to make them resemble the symbol IDs and make them stand out a bit more as "special" in Python. But spaCy normalises them internally, so if you try it, you should see that capitalisation doesn't matter.

To make this more clear in the docs, we could add the following:

A note here mentioning that "token patterns" == Matcher patterns and "phrase patterns" == PhraseMatcher patterns.
A note that capitalisation doesn't matter.
We might as well make the attributes in the entity ruler patterns use uppercase attributes for consistency.

johann-petrak · 2019-08-01T13:42:59Z

Thank you!

Maybe I was also confused by doing things wrong when I tried this out.

One thing I noticed is this: is I initialise the Matcher with validate=True (oddly that parameter does not show up in the docstring for Matcher?) then using an attribute name "whitespace_" cuases an error:

from spacy.lang.en import English
from spacy.matcher import Matcher

nlp = English()
d1 = nlp("This is some text. It has two sentences.")

m1 = Matcher(nlp.vocab, validate=True)
pattern = [{"whitespace_": " "},{"whitespace_": ""}]
m1.add("p1", None, pattern)

This shows messages:

spacy.errors.MatchPatternError: Invalid token patterns for matcher rule 'p1'

Pattern 0:
- Additional properties are not allowed ('whitespace_' was unexpected) [0]
- Additional properties are not allowed ('whitespace_' was unexpected) [1]

Originally I thought this indicates I cannot use the attribute name because the error is not shown only for the upper case versions, but apparently I can.

However:
If I try adding the pattern without validation, it gets accepted and then produces matches. However, I think those matches are wrong:

from spacy.lang.en import English
from spacy.matcher import Matcher

nlp = English()
d1 = nlp("This is some text. It has two sentences.")

m1 = Matcher(nlp.vocab, validate=False)
pattern = [{"whitespace_": " "},{"whitespace_": ""}]
m1.add("p1", None, pattern)

print([(t.text, t.whitespace_) for t in d1])

matches = m1(d1)
for match in matches:
    print(d1[match[1]:match[2]])

As you can see all the consecutive token pairs are matched, but it should really just match those where the second token has whitespace set to the empty string which only occurs twice.

When I try some other rules using either "text" or "ORTH", things indeed do work as expected.

johann-petrak · 2019-08-01T14:00:27Z

OK a bit more experimentation shows that if I disable validation, any name is accepted, but when I create a rule that contains non-existing attribute names, that rule always matches!

So it looks as if whitespace_ is not actualy an attribute name that is recognized by the Matcher and this in turn explains why the rule matches everywhere.

I find it odd that an unknown attribute matches everywhere for an equals condition since logically this should probably be equivalent to matching None, but everything (except None) equal-compares to None as False, not True!

johann-petrak · 2019-08-01T14:33:02Z

Initially I relied on the informaiton on this page: https://spacy.io/usage/rule-based-matching

This has a table of the allowed attribute names (uppercase). I initially thought this is a limited list, but it actually says the table shows only the "most relevant" ones. But when trying to use the uppercased name for something not in this list, e.g. "WHITESPACE_" then withv "validate=True" the matcher will complain that the name "WHITESPACE_" is unexpected and with "validate=False" an incorrect result is produced.

Also, the table says that the extension attributes are supported, but not how: trying something like {"_.myattribute": someval} again gets refused with validate=True and produces wrong results with validate=False.

ines · 2019-08-01T17:30:41Z

I think the main confusion you're running into here is that you're trying to include the underscore, which is not needed. If you look at the examples, you'll see that the underscore vs. non-underscore distinction doesn't matter here, because the patterns always specify the explicit string values, not the IDs (because that'd be pretty inconvenient). So your patterns will specify things like "POS": "NOUN" and not "POS_": "NOUN".

I need to check this but WHITESPACE might be one of the attributes that's currently not supported, because it's one of the special attributes that doesn't have an ID-equivalent and didn't seem that useful for matching. But I guess we could add it for consistency.

{"_.myattribute": someval} again gets refused with validate=True and produces wrong results with validate=False.

The table in the docs shows the top-level properties. So for underscore, that property is _. So your match pattern could look like this: {"_": {"myattribute": "value"}}.

I find it odd that an unknown attribute matches everywhere

The problem here is that the matcher will ignore unknown properties because it cannot match those token attributes. So {"XYZ": "something"} will be interpreted as {}. To avoid that, spaCy would have to always validate, which is inefficient. But maybe we can find some way to make it raise an error here that's less expensive.

johann-petrak · 2019-08-01T18:14:03Z

Thanks that clears up many questions! I should probably have noticed that the trailing underscore is not just not necessary but harmful here, but maybe it would be a good idea to include this info in the docs for people like me?

I generally tend to expect such things to work according to the idea of 'least surprise' so if the rule can contain an attribute, why not simply use the original name, which is what most people would just quietly expect? Personally would I think that consistency would be more helpful here than the convenience of saving the underscore but if that is important I would at least also accept the original name because not doing it is really surprising.

When I use the validate=True parameter, neither 'pos' nor 'whitespace' without an underscore is accepted, both produce an error. I did not check yet if 'pos' works despite getting rejected, but something is weird here.

If I change the code to use validate=False and 'whitespace' without the underscore, the result is still wrong so it looks as if this attribute is not supported. I think it should be, as simply all attributes should be supported (to cause least surprise).

Thanks for pointing out how to use the '_' attribute correctly, this allows me to create a workaround where I first copy the whitespace attribute and then use it from there.

honnibal · 2019-08-05T22:21:44Z

There's an implementation detail that's shining through here, and I guess you're right that it should be clearer.

We can't provide access to all of the attributes because the Matcher loops over the Cython data, not the Python objects. Inside the matcher we're dealing with a spacy.structs.TokenC struct -- we don't have an instance of spacy.tokens.token.Token. This means that all of the attributes that refer to computed properties can't be accessed.

These upper-case names refer to symbols from the spacy.attrs enum table. They're passed into a function that essentially is a big case/switch statement, to figure out which struct field to return. The same attribute identifiers are used in doc.to_array(), and a few other places in the code where you need to describe fields like this.

We do actually let you write the fields case insensitive, so if you write "lower" or "lemma" it will work. We do write the examples in upper-case though, because we want to be clear that the set of possibilities is different.

Add more detailed token pattern checks without full JSON pattern validation and provide more detailed error messages. Addresses explosion#4070 (also related: explosion#4063, explosion#4100). * Check whether top-level attributes in patterns and attr for PhraseMatcher are in token pattern schema * Check whether attribute value types are supported in general (as opposed to per attribute with full validation) * Report various internal error types (OverflowError, AttributeError, KeyError) as ValueError with standard error messages * Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS, LEMMA, and DEP * Add error messages with relevant details on how to use validate=True or nlp() instead of nlp.make_doc() * Support attr=TEXT for PhraseMatcher * Add NORM to schema * Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler

…[ci skip]

* Fix typo in rule-based matching docs * Improve token pattern checking without validation Add more detailed token pattern checks without full JSON pattern validation and provide more detailed error messages. Addresses #4070 (also related: #4063, #4100). * Check whether top-level attributes in patterns and attr for PhraseMatcher are in token pattern schema * Check whether attribute value types are supported in general (as opposed to per attribute with full validation) * Report various internal error types (OverflowError, AttributeError, KeyError) as ValueError with standard error messages * Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS, LEMMA, and DEP * Add error messages with relevant details on how to use validate=True or nlp() instead of nlp.make_doc() * Support attr=TEXT for PhraseMatcher * Add NORM to schema * Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler * Remove unnecessary .keys() * Rephrase error messages * Add another type check to Matcher Add another type check to Matcher for more understandable error messages in some rare cases. * Support phrase_matcher_attr=TEXT for EntityRuler * Don't use spacy.errors in examples and bin scripts * Fix error code * Auto-format Also try get Azure pipelines to finally start a build :( * Update errors.py Co-authored-by: Ines Montani <[email protected]> Co-authored-by: Matthew Honnibal <[email protected]>

lock · 2019-09-05T10:42:45Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added the docs Documentation and website label Aug 1, 2019

ines added the feat / matcher Feature: Token, phrase and dependency matcher label Aug 1, 2019

This was referenced Aug 2, 2019

Add pattern validation to EntityRuler #4069

Closed

Improve handling of unsupported Matcher attributes #4070

Closed

ines closed this as completed in 223bde5 Aug 6, 2019

ines added a commit that referenced this issue Aug 6, 2019

Use consistent casing for entity ruler patterns (see #4063) [ci skip]

4ae320e

adrianeboyd mentioned this issue Aug 12, 2019

Improve token pattern checking without validation #4105

Merged

3 tasks

polm pushed a commit to polm/spaCy that referenced this issue Aug 18, 2019

Improve docs on matcher attributes [ci skip] (closes explosion#4063)

427d18b

polm pushed a commit to polm/spaCy that referenced this issue Aug 18, 2019

Use consistent casing for entity ruler patterns (see explosion#4063) …

d63cc30

…[ci skip]

lock bot locked as resolved and limited conversation to collaborators Sep 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matcher and EntityRuler token patterns should use same language and use original Token attribute names #4063

Matcher and EntityRuler token patterns should use same language and use original Token attribute names #4063

johann-petrak commented Aug 1, 2019

BreakBB commented Aug 1, 2019

ines commented Aug 1, 2019

johann-petrak commented Aug 1, 2019

johann-petrak commented Aug 1, 2019

johann-petrak commented Aug 1, 2019

ines commented Aug 1, 2019

johann-petrak commented Aug 1, 2019

honnibal commented Aug 5, 2019 •

edited by ines

Loading

lock bot commented Sep 5, 2019

Matcher and EntityRuler token patterns should use same language and use original Token attribute names #4063

Matcher and EntityRuler token patterns should use same language and use original Token attribute names #4063

Comments

johann-petrak commented Aug 1, 2019

BreakBB commented Aug 1, 2019

ines commented Aug 1, 2019

johann-petrak commented Aug 1, 2019

johann-petrak commented Aug 1, 2019

johann-petrak commented Aug 1, 2019

ines commented Aug 1, 2019

johann-petrak commented Aug 1, 2019

honnibal commented Aug 5, 2019 • edited by ines Loading

lock bot commented Sep 5, 2019

honnibal commented Aug 5, 2019 •

edited by ines

Loading