Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matcher and EntityRuler token patterns should use same language and use original Token attribute names #4063

Closed
johann-petrak opened this issue Aug 1, 2019 · 9 comments
Labels
docs Documentation and website feat / matcher Feature: Token, phrase and dependency matcher

Comments

@johann-petrak
Copy link

This is one of the most puzzling and not very well documented details about Spacy:

  • the Matcher apparently only allows to match a subset of token attributes and requires that instead of the original names of those attributes, special all-uppercase identifiers have to be used. Why? Why not allow to use all Token attributes and by their original name?
  • The entity ruler examples seem to indicate that lower case, original name attributes can be used. But the documentation does not really say much about this or if other features of the matcher like value ranges, operators and quantifiers can be used.

Ideally the token matching in the entity ruler would work identically to the matcher and both would allow to use all Token attributes with their original name.

@ines ines added the docs Documentation and website label Aug 1, 2019
@BreakBB
Copy link
Contributor

BreakBB commented Aug 1, 2019

I already had a look at the code about this and the EntityRuler internally creates a Matcher (see here) and possibly a PhraseMatcher (see here) as well.

Moreover the add_patterns method of the EntityRuler delegates the pattern to those matchers (see here). So from my understanding the pattern format should be identical and the examples and docs should be updated.

@ines
Copy link
Member

ines commented Aug 1, 2019

Sorry if this was confusing. The EntityRuler doesn't do anything special – its token-based patterns are passed directly to the Matcher. So the attributes they support are the same.

The names being uppercase was more of a convention that was introduced early on to make them resemble the symbol IDs and make them stand out a bit more as "special" in Python. But spaCy normalises them internally, so if you try it, you should see that capitalisation doesn't matter.

To make this more clear in the docs, we could add the following:

  • A note here mentioning that "token patterns" == Matcher patterns and "phrase patterns" == PhraseMatcher patterns.
  • A note that capitalisation doesn't matter.
  • We might as well make the attributes in the entity ruler patterns use uppercase attributes for consistency.

@ines ines added the feat / matcher Feature: Token, phrase and dependency matcher label Aug 1, 2019
@johann-petrak
Copy link
Author

Thank you!

Maybe I was also confused by doing things wrong when I tried this out.

One thing I noticed is this: is I initialise the Matcher with validate=True (oddly that parameter does not show up in the docstring for Matcher?) then using an attribute name "whitespace_" cuases an error:

from spacy.lang.en import English
from spacy.matcher import Matcher

nlp = English()
d1 = nlp("This is some text. It has two sentences.")

m1 = Matcher(nlp.vocab, validate=True)
pattern = [{"whitespace_": " "},{"whitespace_": ""}]
m1.add("p1", None, pattern)

This shows messages:

spacy.errors.MatchPatternError: Invalid token patterns for matcher rule 'p1'

Pattern 0:
- Additional properties are not allowed ('whitespace_' was unexpected) [0]
- Additional properties are not allowed ('whitespace_' was unexpected) [1]

Originally I thought this indicates I cannot use the attribute name because the error is not shown only for the upper case versions, but apparently I can.

However:
If I try adding the pattern without validation, it gets accepted and then produces matches. However, I think those matches are wrong:

from spacy.lang.en import English
from spacy.matcher import Matcher

nlp = English()
d1 = nlp("This is some text. It has two sentences.")

m1 = Matcher(nlp.vocab, validate=False)
pattern = [{"whitespace_": " "},{"whitespace_": ""}]
m1.add("p1", None, pattern)

print([(t.text, t.whitespace_) for t in d1])

matches = m1(d1)
for match in matches:
    print(d1[match[1]:match[2]])

As you can see all the consecutive token pairs are matched, but it should really just match those where the second token has whitespace set to the empty string which only occurs twice.

When I try some other rules using either "text" or "ORTH", things indeed do work as expected.

@johann-petrak
Copy link
Author

OK a bit more experimentation shows that if I disable validation, any name is accepted, but when I create a rule that contains non-existing attribute names, that rule always matches!

So it looks as if whitespace_ is not actualy an attribute name that is recognized by the Matcher and this in turn explains why the rule matches everywhere.

I find it odd that an unknown attribute matches everywhere for an equals condition since logically this should probably be equivalent to matching None, but everything (except None) equal-compares to None as False, not True!

@johann-petrak
Copy link
Author

Initially I relied on the informaiton on this page: https://spacy.io/usage/rule-based-matching

This has a table of the allowed attribute names (uppercase). I initially thought this is a limited list, but it actually says the table shows only the "most relevant" ones. But when trying to use the uppercased name for something not in this list, e.g. "WHITESPACE_" then withv "validate=True" the matcher will complain that the name "WHITESPACE_" is unexpected and with "validate=False" an incorrect result is produced.

Also, the table says that the extension attributes are supported, but not how: trying something like {"_.myattribute": someval} again gets refused with validate=True and produces wrong results with validate=False.

@ines
Copy link
Member

ines commented Aug 1, 2019

I think the main confusion you're running into here is that you're trying to include the underscore, which is not needed. If you look at the examples, you'll see that the underscore vs. non-underscore distinction doesn't matter here, because the patterns always specify the explicit string values, not the IDs (because that'd be pretty inconvenient). So your patterns will specify things like "POS": "NOUN" and not "POS_": "NOUN".

I need to check this but WHITESPACE might be one of the attributes that's currently not supported, because it's one of the special attributes that doesn't have an ID-equivalent and didn't seem that useful for matching. But I guess we could add it for consistency.

{"_.myattribute": someval} again gets refused with validate=True and produces wrong results with validate=False.

The table in the docs shows the top-level properties. So for underscore, that property is _. So your match pattern could look like this: {"_": {"myattribute": "value"}}.

I find it odd that an unknown attribute matches everywhere

The problem here is that the matcher will ignore unknown properties because it cannot match those token attributes. So {"XYZ": "something"} will be interpreted as {}. To avoid that, spaCy would have to always validate, which is inefficient. But maybe we can find some way to make it raise an error here that's less expensive.

@johann-petrak
Copy link
Author

Thanks that clears up many questions! I should probably have noticed that the trailing underscore is not just not necessary but harmful here, but maybe it would be a good idea to include this info in the docs for people like me?

I generally tend to expect such things to work according to the idea of 'least surprise' so if the rule can contain an attribute, why not simply use the original name, which is what most people would just quietly expect? Personally would I think that consistency would be more helpful here than the convenience of saving the underscore but if that is important I would at least also accept the original name because not doing it is really surprising.

When I use the validate=True parameter, neither 'pos' nor 'whitespace' without an underscore is accepted, both produce an error. I did not check yet if 'pos' works despite getting rejected, but something is weird here.

If I change the code to use validate=False and 'whitespace' without the underscore, the result is still wrong so it looks as if this attribute is not supported. I think it should be, as simply all attributes should be supported (to cause least surprise).

Thanks for pointing out how to use the '_' attribute correctly, this allows me to create a workaround where I first copy the whitespace attribute and then use it from there.

@honnibal
Copy link
Member

honnibal commented Aug 5, 2019

There's an implementation detail that's shining through here, and I guess you're right that it should be clearer.

We can't provide access to all of the attributes because the Matcher loops over the Cython data, not the Python objects. Inside the matcher we're dealing with a spacy.structs.TokenC struct -- we don't have an instance of spacy.tokens.token.Token. This means that all of the attributes that refer to computed properties can't be accessed.

These upper-case names refer to symbols from the spacy.attrs enum table. They're passed into a function that essentially is a big case/switch statement, to figure out which struct field to return. The same attribute identifiers are used in doc.to_array(), and a few other places in the code where you need to describe fields like this.

We do actually let you write the fields case insensitive, so if you write "lower" or "lemma" it will work. We do write the examples in upper-case though, because we want to be clear that the set of possibilities is different.

@ines ines closed this as completed in 223bde5 Aug 6, 2019
adrianeboyd added a commit to adrianeboyd/spaCy that referenced this issue Aug 12, 2019
Add more detailed token pattern checks without full JSON pattern validation and
provide more detailed error messages.

Addresses explosion#4070 (also related: explosion#4063, explosion#4100).

* Check whether top-level attributes in patterns and attr for PhraseMatcher are
  in token pattern schema

* Check whether attribute value types are supported in general (as opposed to
  per attribute with full validation)

* Report various internal error types (OverflowError, AttributeError, KeyError)
  as ValueError with standard error messages

* Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS,
  LEMMA, and DEP

* Add error messages with relevant details on how to use validate=True or nlp()
  instead of nlp.make_doc()

* Support attr=TEXT for PhraseMatcher

* Add NORM to schema

* Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler
polm pushed a commit to polm/spaCy that referenced this issue Aug 18, 2019
ines added a commit that referenced this issue Aug 21, 2019
* Fix typo in rule-based matching docs

* Improve token pattern checking without validation

Add more detailed token pattern checks without full JSON pattern validation and
provide more detailed error messages.

Addresses #4070 (also related: #4063, #4100).

* Check whether top-level attributes in patterns and attr for PhraseMatcher are
  in token pattern schema

* Check whether attribute value types are supported in general (as opposed to
  per attribute with full validation)

* Report various internal error types (OverflowError, AttributeError, KeyError)
  as ValueError with standard error messages

* Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS,
  LEMMA, and DEP

* Add error messages with relevant details on how to use validate=True or nlp()
  instead of nlp.make_doc()

* Support attr=TEXT for PhraseMatcher

* Add NORM to schema

* Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler

* Remove unnecessary .keys()

* Rephrase error messages

* Add another type check to Matcher

Add another type check to Matcher for more understandable error messages
in some rare cases.

* Support phrase_matcher_attr=TEXT for EntityRuler

* Don't use spacy.errors in examples and bin scripts

* Fix error code

* Auto-format

Also try get Azure pipelines to finally start a build :(

* Update errors.py


Co-authored-by: Ines Montani <[email protected]>
Co-authored-by: Matthew Honnibal <[email protected]>
@lock
Copy link

lock bot commented Sep 5, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website feat / matcher Feature: Token, phrase and dependency matcher
Projects
None yet
Development

No branches or pull requests

4 participants