-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGEX flag for the Matcher #1833
Conversation
Regex attribute for Matcher patterns
😱 I forgot about windows! |
Interesting, thanks! Particularly like the use of the C-level regex library. Question though. The regular expression is applied to the import re
def add_regex_flag(vocab, pattern_str):
flag_id = vocab.add_flag(re.compile(pattern_str).match)
return flag_id Then you write something like this: IS_REGEX_MATCH = add_regex_flag(vocab, '^([Uu](\\.?|nited) ?[Ss](\\.?|tates)')
us_president_pattern = [
{'LOWER': 'the'},
{IS_REGEX_MATCH: True},
{'LOWER': 'president'}
] This applies the regex over the vocab, so it runs once per type, not once per token. Now, this only works on single tokens, so you have to make sure |
Thanks for your work on this – it's definitely a feature that has been requested a lot, so it'd be great to make it happen! Also linking the related issue #1567. It also includes a nice code snippet by @yarongon, which is the solution we'd currently recommend for regex-only matching: NUM_PATTERN = re.compile(r"\d+")
for match in re.finditer(NUM_PATTERN, doc.text):
start, end = match.span()
print(f"The matched text: '{doc.text[start:end]}'")
span = doc.char_span(start, end) |
Hey @honnibal, thanks for showing this approach. I thought it wasn't possible because I couldn't find an example of it but it was definitely my first thought (I didn't think of the flags as the integer returned after creation). It is a reasonable and very spaCy way of doing it. Also, you are absolutely right about the tokenization. If I can choose, I'd have both because that would enable me most as a user. I will be able to optimise recurring patterns (e.g. something capturing hedge words) through a vocabulary flag and specific patterns through a regex attribute (e.g. handling typos for a particular word). I'm also a bit apprehensive about adding thousands of throwaway flags to the vocabulary. More complex systems or systems that do matching at scale would have to create a lot of those. @ines thanks for that example. I didn't know about it and it's definitely something I'd use. |
You can't have thousands of flags -- the maximum is 64. The I do see the use for this, but I'm reluctant to have these two highly overlapping ways to achieve the same thing. It's also quite exceptional: everything else the |
@honnibal. It would be really nice if we could apply regex to other things than ORTH. For example i wuold like to find verb in infinitive by looking at the TAG. So i would like to match "VERB__inf" and "AUX__inf" for example. |
Closing this, as improved matcher plans are now getting done in #1971 🎉 |
Description
We've added a new feature:
REGEX
flag/attribute for theMatcher
patterns. It allows the use of a regular expression to match a token in a pattern. Here's a simple example of a pattern including a regex attribute:The
REGEX
flag should function just as the rest of the pattern flags: it works in conjunction with operators, etc.Types of change
Checklist
Hat tip to @jackatbancast for his help!