-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
💫 Better, faster and more customisable matcher #1971
Comments
Serialize Phrase Matcher would be great. |
Thank you! This will be awesome! I basically need all these new features! Do you have any estimates on when this will be available to try out? |
A further improvement on this might be to enable running a matcher after other matchers. i.e to first recognize a pattern, add that patten as an attribute and then run a matcher on that attribute again. Example:
convert it to:
And then recognize this pattern:
This might also simplify pattern creation because you can generalize and match simple patterns and re use them in later patterns. For example i want all "adj + verb" patterns and "verb + adj" be matched and marked as "ACTION". Then use this in another matcher later:
Then match something like this: Where the key "MATCH" is previous matches and ACTION is the previous match id. Lastly this would also solve issues related to "or" patterns for petterns that depend on two or more tokens. Since you can then combine the new set feature "IN"/"NOT_IN" with previously matched patterns. This would allow for much more complex types of patterns. I hope you will consider this extension and ill be happy to contribute to developing this. I think all that is needed is to set a hierarchy variable to match patterns, and then run the matchers in each hierarchy level after one another, storing previous matches as attributes. EDIT I realized after writing this that this can be done using the new logic for custom attributes "_", if I’m not mistaken. By adding matches to a custom variable on the doc object and then running the second set of matchers after this. However it is a bit much custom logic needed. |
@ohenrik isn't this already possible through an
|
An alternative to the hierarchical matching patterns would be to allow for sub patterns directly in the patterns: At the moment every rule is for one token. So like this: But by adding a sub list you could actually create a rule set that can use all the other operators in combination: Pattern = [ [token_rule, token_rule, token_rule], [token_rule, token_rule, token_rule] ] Meaning this would be possible: |
@savkov Yes, i realized it will be possible, but a bit much hassle to get it working. I think having nested patterns would solve this better, as i have proposed above. The example might not be that clear though |
@ohenrik @savkov @GregDubbin @thomasopsomer Now we need tests with non-trivial grammars. Benchmarks would be particularly useful. I'm hoping the new matcher is faster, but it's hard to guess. If you have a non-trivial grammar handy, could you try it out and see how it looks? Once we're confident the reimplementation is correct, we can push a v2.0.8 that makes the replacement, to fix #1945 . Then we can implement the fancy new pattern logic for v2.1.0 🎉 |
Hm so far the PhraseMatcher seems much slower. Running more tests. |
@honnibal I can test it out too, if you still need someone to test it for speed? I don’t have code using the PhraseMatcher though, but most of the other old functionality i can test against the new. |
@ohenrik Yes please. I just pushed an update that fixes the speed problems on my PhraseMatcher benchmark, using the data from https:/mpuig/spacy-lookup Taking the first 1000 documents of IMDB train, and using the 400k patterns in
Once the doc is tokenized, using the I think the difference in number of matches between the PhraseMatcher and spacy-lookup comes from handling of overlaps. I think it's nice to be able to return all matches, even if the domain of two patterns overlaps. This is different from how regex engines, but I think it's better semantics for what we want here. If I have patterns for "fresh food" and "food delivery", we do want matches for both if the input text is "fresh food delivery". |
@honnibal I agree, in most of the use cases i have and foresee i would want to include overlapping matches :) A switch for including overlapping or not would be a bonus though. Related to the PhraseMatches. Is it possible to add patterns to this? I.e. not just match exact words in sequence, but match token TAG or POS sequences too? |
@ohenrik The PhraseMatcher exploits the fact that the The trick fundamentally relies on the fact that the lexemes are indexed by How about this. We could build a temporary lemmas = Doc(doc.vocab, words=[token.lemma_ for token in doc]) Then we can pass this (I wish I'd thought of this solution earlier --- it would've worked all along!) |
@honnibal Nice explanation and solution! 🎉 Thank you! 🙂 |
The Now that the Matcher is returning all matches for quantifiers, we can remove the length limit in the PhraseMatcher 🎉 . Previously, if we had a pattern like This change currently makes the phrase-matcher a bit less efficient: speed drops from 2m tokens per second to 1.3m. This is still much faster than the tokenizer, and we can recover the missing speed if necessary by having more specific patterns for more lengths. |
Suggested usage / user-facing API for now: matcher = PhraseMatcher(nlp.vocab, attribute='POS')
matcher.add('PATTERN', None, nlp(u"I like cats")) # kinda useless example, but you get the idea
matches = matcher(nlp(u"You love dogs"))
assert len(matches) == 1 |
Found another one related to merging of
When do you plan to release the next nightly? |
@mr-bjerre Thanks for the thorough testing! I've added an xfailing test for your first example. The second problem you describe is more of a problem with merging overlapping spans. After that first loop, the indices of the next span have changed, because parts of it have been merged into the previous token already. You can see this if you print the token texts for each span in the loop. There's not really an easy answer for this, so spaCy should raise an error, which it can't do with However, if you're using the new doc = nlp('“Net sales and result reached record levels for a second quarter.“')
spans = [Span(doc, 1, 3), Span(doc, 2, 3)]
with doc.retokenize() as retokenizer:
for span in spans:
retokenizer.merge(span)
|
Yes I realize that was the problem. However I thought that was the reason to collect the spans in a list in the first place in this example. I suppose I will use EDITI suppose that won't work since it's a bulk change. You are saying there are no easy workarounds? |
Hi @ines,
This code yields matches of the MYLABEL pattern. However, when you comment-out line Do you have an idea what could cause this behaviour? |
@mr-bjerre Well, there's just not an easy answer for how this should be handled by default. If your tokens are A, B and C and your merging specifies A + B = AB, and B + C = BC, that's a state that's not achievable. One token ABC is one possible solution – but if you need that, you'll have to specify the merging like that explicitly. Each span exposes its start and end token index, so you can write you own filter function that checks whether spans overlap and if so, only choose the longest span for merging. (That's mostly just an algorithmic question and not really specific to spaCy or NLP.) @paulrinckens Thanks for the report and for testing the nightly! 👍 If you have time, do you think you'd be able to break your example down into a simpler function with the minimum needed to reproduce the problem? It's a bit hard to follow with all the scaffolding code and programmatically generated extensions. Edit: Okay, I think I came up with something similar: see 3af0b2d. |
@ines Yes this seems as a similar issue. However, the issue I was describing occurs when referring to multiple extension attributes on one individual token in the matcher pattern. Here is a more reduced test:
|
@paulrinckens Thanks! While writing the test, I actually uncovered a memory error (leading to a segmentation fault), so there's definitely something wrong under the hood here. I suspect there might only be one or two small issues that end up causing all of the reported problems here, since they're all quite similar. |
Hello @ines , I think I have similar issue like @paulrinckens here and similar like I had a few days ago. ` pipeline = spacy.load('de', disable=['ner'])
When I run this code, output is following:
And when I exclude Extension_2 and remove use text "Es knallt und und und", I have following output:
It seems like the last "und" is lost or not recognized. Do you have idea what could be wrong? |
Another cool feature would be to add additional operators like |
It would also be very cool to allow for an optional sequence of
EDITIt might be better to have a pipeline merging
What are the differences in performance? Is it better to FIRST find a new custom attribute and THEN use that in later matches OR is it better to find the matches in one go (using more patterns)? |
Sorry for spam today but an
At the moment you have to add a whole new pattern so that increases exponentially when you have a lot of those cases. EDITI suppose you could add |
Yes, I guess that's what I would have suggested. There's definitely a point at which patterns become very difficult to read, validate and maintain – so even if what you describe was possible, you'd probably want to break the logic into two steps: first, retokenize the document so that the tokens match your definition of "words" or "logical units", then match based on these units. Performance-wise, matching a single token based on a simple equality comparison of one attribute is probably better than matching a sequence of tokens with multiple operators. But it likely doesn't matter that much. If you really want the ultimate efficiency, you should encode all your custom attributes as binary flags – but again, it might not actually be noticable or worth it.
The def get_custom_norm(token):
return my_custom_norm_dict.get(token.text, token.norm_) |
That was clever with the getter ! But there seems to be a bug. Might be related to the other bugs.
I have
Of course I could also just have
but how is that performance-wise compared to the |
I agree with you @ines. I made a function that takes care of the overlapping spans so I get ABC as I intend to. I suppose this shouldn't throw a
I get
|
@mr-bjerre Yes, this seems to be the same retokenization issue as #3288 (also see my regression test that shows the underlying problem). I think the other problem you reported looks like it's related to the other matcher issue as well. Matching on custom attributes isn't reliable yet in the current nightly. (We already tracked this down, though.) |
Alright thanks for update. Tracked down meaning solved? What about the issue with ? operator? That one is causing me a lot of trouble - let me know if I can be of any help. |
* Fix matching on extension attrs and predicates * Fix detection of match_id when using extension attributes. The match ID is stored as the last entry in the pattern. We were checking for this with nr_attr == 0, which didn't account for extension attributes. * Fix handling of predicates. The wrong count was being passed through, so even patterns that didn't have a predicate were being checked. * Fix regex pattern * Fix matcher set value test
Btw. @ines is it correct that
yields error
|
That should be fixed in v2.1 --- if you try |
Hi @ines I think I found another bug in the new Matcher. When using regex on custom features with the OP attribute, I get matches that I did not expect to get. See this code snippet: import spacy
from spacy.tokens import Token, Span
from spacy.matcher import Matcher
nlp = spacy.load('de', disable=['ner'])
doc = nlp("Das ist Text")
Token.set_extension("a", default="False")
doc[0]._.set("a","x")
doc[1]._.set("a","y")
matcher = Matcher(nlp.vocab)
pattern = [{'_': {'a': {'REGEX': 'x'}}}, {'_': {'a': {'REGEX': 'y'}}, 'OP': '*'}]
matcher.add("MYLABEL", None, pattern)
# Expect matches "Das" and "Das ist"
assert(len(matcher(doc)) == 2) However, the matches obtained from the Matcher are: Do you have an idea what could have caused this last obviously incorrect match "Text"? EDIT: This seems to be fixed in 2.1.0a10 |
spacy version = 2.0.12 Some unexpected behavior with Matcher: Issue 1
Issue 2
I tried updating to the latest version of Spacy (and then to |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Related issues: #1567, #1711, #1819, #1939, #1945, #1951, #2042
We're currently in the process of rewriting the match loop, fixing long-standing issues and making it easier to extend the
Matcher
andPhraseMatcher
. The community contributions by @GregDubbin and @savkov have already made a big difference – we can't wait to get it all ready and shipped.This issue discusses some of the planned new features and additions to the match patterns API, including matching by custom extension attributes (
Token._.
), regular expressions, set membership and rich comparison for numeric values.New features
Custom extension attributes
spaCy v2.0 introduced custom extension attributes on the
Doc
,Span
andToken
. Custom attributes make it easier to attach arbitrary data to the built-in objects, and let users take advantage of spaCy's data structures and theDoc
object as the "single source of truth". However, not being able to match on custom attributes was quite limiting (see #1499, #1825).The new patterns spec will allow an
_
space on token patterns, which can map to a dictionary keyed by the attribute names:Both regular attribute extensions (with a default value) and property extensions (with a getter) will be supported and can be combined for more exact matches.
Rich comparison for numeric values
Token patterns already allow specifying a
LENGTH
(the token's character length). However, matching tokens of between five and ten characters previously required adding 6 copies of the exact same pattern, introducing unnecessary overhead. Numeric attributes can now also specify a dictionary with the predicate (e.g.'>'
or'<='
) mapped to the value. For example:The second pattern above will match a token with the entity type
ORG
that's 5 or more characters long. Combined with custom attributes, this allows very powerful queries combining both linguistic features and numeric data:Defining predicates and values as a dictionary instead of a single string like
'>=5'
allows us to avoid string parsing, and lets spaCy handle custom attributes without requiring the user to specify their types upfront. (While we know the type of the built-inLENGTH
attribute, spaCy has no way of knowing whether the value'<3'
of a custom attribute should be interpreted as "less than 3", or the heart emoticon.)Set membership
This is another feature that has been requested before and will now be much easier to implement. Similar to the predicate mapping for numeric values, token attributes can now also be defined as dictionaries. The keys
IN
orNOT_IN
can be used to indicate set membership and non-membership.The above pattern will match a token with the lemma "like" or "love", followed by a token whose lowercase form is either "apples" or "bananas". For example, "loving apples" or "likes bananas". Lists can be used for all non-boolean values, including custom
_
attributes:Regular expressions
Using regular expressions within token patterns is already possible via custom binary flags (see #1567). However, this has some inconvenient limitations – including the patterns not being JSON-serializable. If the solution is to add binary flags, spaCy might as well take care of that. The following example is based on the work by @savkov (see #1833):
'REGEX'
as an operator (instead of a top-level property that only matches on the token's text) allows defining rules for any string value, including custom attributes:New operators
TL;DR: The new patterns spec will allow two ways of defining properties – attribute values for exact matches and dictionaries using operators for more fine-grained matches.
The following operators can be used within dictionaries describing attribute values:
==
,>=
,<=
,>
,<
'LENGTH': {'>': 10}
IN
'LEMMA': {'IN': ['like', 'love']}
NOT_IN
'POS': {'NOT_IN': ['NOUN', 'PROPN']}
REGEX
'TAG': {'REGEX': '^V'}
API improvements and bug fixes
See @honnibal's comments in #1945 and the
feature/better-faster-matcher
branch for more details and implementation examples.Other fixes
PhraseMatcher
.PhraseMatcher
.Matcher.pipe
should yield matches instead ofDoc
objects.PhraseMatcher
."TEXT"
as an alternative to"ORTH"
(for consistency).The text was updated successfully, but these errors were encountered: