Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGEX flag for the Matcher #1833

Closed
wants to merge 9 commits into from
Closed

REGEX flag for the Matcher #1833

wants to merge 9 commits into from

Conversation

savkov
Copy link
Contributor

@savkov savkov commented Jan 12, 2018

Description

We've added a new feature: REGEX flag/attribute for the Matcher patterns. It allows the use of a regular expression to match a token in a pattern. Here's a simple example of a pattern including a regex attribute:

us_president_pattern = [
    {'LOWER': 'the'},
    {'REGEX': '^([Uu](\\.?|nited) ?[Ss](\\.?|tates)'},
    {'LOWER': 'president'}
]

The REGEX flag should function just as the rest of the pattern flags: it works in conjunction with operators, etc.

Types of change

  • a new feature was added to the Matcher
  • some tests were added to support the feature
  • some tests were improved as the new feature exposes some of their weaknesses

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

Hat tip to @jackatbancast for his help!

@savkov
Copy link
Contributor Author

savkov commented Jan 12, 2018

😱 I forgot about windows!

@ines ines added the enhancement Feature requests and improvements label Jan 12, 2018
@honnibal
Copy link
Member

Interesting, thanks! Particularly like the use of the C-level regex library.

Question though. The regular expression is applied to the ORTH content, right? So why can't we do something like this:

import re

def add_regex_flag(vocab, pattern_str):
    flag_id = vocab.add_flag(re.compile(pattern_str).match)
    return flag_id

Then you write something like this:

IS_REGEX_MATCH = add_regex_flag(vocab, '^([Uu](\\.?|nited) ?[Ss](\\.?|tates)')

us_president_pattern = [
    {'LOWER': 'the'},
    {IS_REGEX_MATCH: True},
    {'LOWER': 'president'}
]

This applies the regex over the vocab, so it runs once per type, not once per token.

Now, this only works on single tokens, so you have to make sure United States is tokenized together. This is inconvenient, but the same is actually true of your patch, if I understand correctly?

@ines
Copy link
Member

ines commented Jan 14, 2018

Thanks for your work on this – it's definitely a feature that has been requested a lot, so it'd be great to make it happen!

Also linking the related issue #1567. It also includes a nice code snippet by @yarongon, which is the solution we'd currently recommend for regex-only matching:

NUM_PATTERN = re.compile(r"\d+")

for match in re.finditer(NUM_PATTERN, doc.text):
    start, end = match.span()
    print(f"The matched text: '{doc.text[start:end]}'")
    span = doc.char_span(start, end)

@savkov
Copy link
Contributor Author

savkov commented Jan 14, 2018

Hey @honnibal, thanks for showing this approach. I thought it wasn't possible because I couldn't find an example of it but it was definitely my first thought (I didn't think of the flags as the integer returned after creation). It is a reasonable and very spaCy way of doing it. Also, you are absolutely right about the tokenization.

If I can choose, I'd have both because that would enable me most as a user. I will be able to optimise recurring patterns (e.g. something capturing hedge words) through a vocabulary flag and specific patterns through a regex attribute (e.g. handling typos for a particular word). I'm also a bit apprehensive about adding thousands of throwaway flags to the vocabulary. More complex systems or systems that do matching at scale would have to create a lot of those.

@ines thanks for that example. I didn't know about it and it's definitely something I'd use.

@honnibal
Copy link
Member

You can't have thousands of flags -- the maximum is 64. The Matcher would perform terribly like that anyway.

I do see the use for this, but I'm reluctant to have these two highly overlapping ways to achieve the same thing. It's also quite exceptional: everything else the Matcher patterns refer to are token or lexeme attributes --- this is a distinct mechanism.

@ohenrik
Copy link
Contributor

ohenrik commented Feb 13, 2018

The regular expression is applied to the ORTH content, right?

@honnibal. It would be really nice if we could apply regex to other things than ORTH. For example i wuold like to find verb in infinitive by looking at the TAG. So i would like to match "VERB__inf" and "AUX__inf" for example.

@honnibal
Copy link
Member

Closing this, as improved matcher plans are now getting done in #1971 🎉

@honnibal honnibal closed this Feb 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants