-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matcher and EntityRuler token patterns should use same language and use original Token attribute names #4063
Comments
I already had a look at the code about this and the EntityRuler internally creates a Matcher (see here) and possibly a PhraseMatcher (see here) as well. Moreover the |
Sorry if this was confusing. The The names being uppercase was more of a convention that was introduced early on to make them resemble the symbol IDs and make them stand out a bit more as "special" in Python. But spaCy normalises them internally, so if you try it, you should see that capitalisation doesn't matter. To make this more clear in the docs, we could add the following:
|
Thank you! Maybe I was also confused by doing things wrong when I tried this out. One thing I noticed is this: is I initialise the Matcher with from spacy.lang.en import English
from spacy.matcher import Matcher
nlp = English()
d1 = nlp("This is some text. It has two sentences.")
m1 = Matcher(nlp.vocab, validate=True)
pattern = [{"whitespace_": " "},{"whitespace_": ""}]
m1.add("p1", None, pattern) This shows messages:
Originally I thought this indicates I cannot use the attribute name because the error is not shown only for the upper case versions, but apparently I can. However: from spacy.lang.en import English
from spacy.matcher import Matcher
nlp = English()
d1 = nlp("This is some text. It has two sentences.")
m1 = Matcher(nlp.vocab, validate=False)
pattern = [{"whitespace_": " "},{"whitespace_": ""}]
m1.add("p1", None, pattern)
print([(t.text, t.whitespace_) for t in d1])
matches = m1(d1)
for match in matches:
print(d1[match[1]:match[2]]) As you can see all the consecutive token pairs are matched, but it should really just match those where the second token has whitespace set to the empty string which only occurs twice. When I try some other rules using either "text" or "ORTH", things indeed do work as expected. |
OK a bit more experimentation shows that if I disable validation, any name is accepted, but when I create a rule that contains non-existing attribute names, that rule always matches! So it looks as if I find it odd that an unknown attribute matches everywhere for an equals condition since logically this should probably be equivalent to matching None, but everything (except None) equal-compares to None as False, not True! |
Initially I relied on the informaiton on this page: https://spacy.io/usage/rule-based-matching This has a table of the allowed attribute names (uppercase). I initially thought this is a limited list, but it actually says the table shows only the "most relevant" ones. But when trying to use the uppercased name for something not in this list, e.g. "WHITESPACE_" then withv "validate=True" the matcher will complain that the name "WHITESPACE_" is unexpected and with "validate=False" an incorrect result is produced. Also, the table says that the extension attributes are supported, but not how: trying something like {"_.myattribute": someval} again gets refused with validate=True and produces wrong results with validate=False. |
I think the main confusion you're running into here is that you're trying to include the underscore, which is not needed. If you look at the examples, you'll see that the underscore vs. non-underscore distinction doesn't matter here, because the patterns always specify the explicit string values, not the IDs (because that'd be pretty inconvenient). So your patterns will specify things like I need to check this but
The table in the docs shows the top-level properties. So for underscore, that property is
The problem here is that the matcher will ignore unknown properties because it cannot match those token attributes. So |
Thanks that clears up many questions! I should probably have noticed that the trailing underscore is not just not necessary but harmful here, but maybe it would be a good idea to include this info in the docs for people like me? I generally tend to expect such things to work according to the idea of 'least surprise' so if the rule can contain an attribute, why not simply use the original name, which is what most people would just quietly expect? Personally would I think that consistency would be more helpful here than the convenience of saving the underscore but if that is important I would at least also accept the original name because not doing it is really surprising. When I use the validate=True parameter, neither 'pos' nor 'whitespace' without an underscore is accepted, both produce an error. I did not check yet if 'pos' works despite getting rejected, but something is weird here. If I change the code to use validate=False and 'whitespace' without the underscore, the result is still wrong so it looks as if this attribute is not supported. I think it should be, as simply all attributes should be supported (to cause least surprise). Thanks for pointing out how to use the '_' attribute correctly, this allows me to create a workaround where I first copy the whitespace attribute and then use it from there. |
There's an implementation detail that's shining through here, and I guess you're right that it should be clearer. We can't provide access to all of the attributes because the These upper-case names refer to symbols from the We do actually let you write the fields case insensitive, so if you write "lower" or "lemma" it will work. We do write the examples in upper-case though, because we want to be clear that the set of possibilities is different. |
Add more detailed token pattern checks without full JSON pattern validation and provide more detailed error messages. Addresses explosion#4070 (also related: explosion#4063, explosion#4100). * Check whether top-level attributes in patterns and attr for PhraseMatcher are in token pattern schema * Check whether attribute value types are supported in general (as opposed to per attribute with full validation) * Report various internal error types (OverflowError, AttributeError, KeyError) as ValueError with standard error messages * Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS, LEMMA, and DEP * Add error messages with relevant details on how to use validate=True or nlp() instead of nlp.make_doc() * Support attr=TEXT for PhraseMatcher * Add NORM to schema * Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler
* Fix typo in rule-based matching docs * Improve token pattern checking without validation Add more detailed token pattern checks without full JSON pattern validation and provide more detailed error messages. Addresses #4070 (also related: #4063, #4100). * Check whether top-level attributes in patterns and attr for PhraseMatcher are in token pattern schema * Check whether attribute value types are supported in general (as opposed to per attribute with full validation) * Report various internal error types (OverflowError, AttributeError, KeyError) as ValueError with standard error messages * Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS, LEMMA, and DEP * Add error messages with relevant details on how to use validate=True or nlp() instead of nlp.make_doc() * Support attr=TEXT for PhraseMatcher * Add NORM to schema * Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler * Remove unnecessary .keys() * Rephrase error messages * Add another type check to Matcher Add another type check to Matcher for more understandable error messages in some rare cases. * Support phrase_matcher_attr=TEXT for EntityRuler * Don't use spacy.errors in examples and bin scripts * Fix error code * Auto-format Also try get Azure pipelines to finally start a build :( * Update errors.py Co-authored-by: Ines Montani <[email protected]> Co-authored-by: Matthew Honnibal <[email protected]>
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
This is one of the most puzzling and not very well documented details about Spacy:
Ideally the token matching in the entity ruler would work identically to the matcher and both would allow to use all Token attributes with their original name.
The text was updated successfully, but these errors were encountered: