REGEX flag for the Matcher #1833

savkov · 2018-01-12T14:45:43Z

Description

We've added a new feature: REGEX flag/attribute for the Matcher patterns. It allows the use of a regular expression to match a token in a pattern. Here's a simple example of a pattern including a regex attribute:

us_president_pattern = [
    {'LOWER': 'the'},
    {'REGEX': '^([Uu](\\.?|nited) ?[Ss](\\.?|tates)'},
    {'LOWER': 'president'}
]

The REGEX flag should function just as the rest of the pattern flags: it works in conjunction with operators, etc.

Types of change

a new feature was added to the Matcher
some tests were added to support the feature
some tests were improved as the new feature exposes some of their weaknesses

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

Hat tip to @jackatbancast for his help!

Regex attribute for Matcher patterns

savkov · 2018-01-12T15:01:09Z

😱 I forgot about windows!

honnibal · 2018-01-14T12:16:35Z

Interesting, thanks! Particularly like the use of the C-level regex library.

Question though. The regular expression is applied to the ORTH content, right? So why can't we do something like this:

import re

def add_regex_flag(vocab, pattern_str):
    flag_id = vocab.add_flag(re.compile(pattern_str).match)
    return flag_id

Then you write something like this:

IS_REGEX_MATCH = add_regex_flag(vocab, '^([Uu](\\.?|nited) ?[Ss](\\.?|tates)')

us_president_pattern = [
    {'LOWER': 'the'},
    {IS_REGEX_MATCH: True},
    {'LOWER': 'president'}
]

This applies the regex over the vocab, so it runs once per type, not once per token.

Now, this only works on single tokens, so you have to make sure United States is tokenized together. This is inconvenient, but the same is actually true of your patch, if I understand correctly?

ines · 2018-01-14T12:26:38Z

Thanks for your work on this – it's definitely a feature that has been requested a lot, so it'd be great to make it happen!

Also linking the related issue #1567. It also includes a nice code snippet by @yarongon, which is the solution we'd currently recommend for regex-only matching:

NUM_PATTERN = re.compile(r"\d+")

for match in re.finditer(NUM_PATTERN, doc.text):
    start, end = match.span()
    print(f"The matched text: '{doc.text[start:end]}'")
    span = doc.char_span(start, end)

…skip]

savkov · 2018-01-14T16:04:44Z

Hey @honnibal, thanks for showing this approach. I thought it wasn't possible because I couldn't find an example of it but it was definitely my first thought (I didn't think of the flags as the integer returned after creation). It is a reasonable and very spaCy way of doing it. Also, you are absolutely right about the tokenization.

If I can choose, I'd have both because that would enable me most as a user. I will be able to optimise recurring patterns (e.g. something capturing hedge words) through a vocabulary flag and specific patterns through a regex attribute (e.g. handling typos for a particular word). I'm also a bit apprehensive about adding thousands of throwaway flags to the vocabulary. More complex systems or systems that do matching at scale would have to create a lot of those.

@ines thanks for that example. I didn't know about it and it's definitely something I'd use.

honnibal · 2018-01-14T19:32:32Z

You can't have thousands of flags -- the maximum is 64. The Matcher would perform terribly like that anyway.

I do see the use for this, but I'm reluctant to have these two highly overlapping ways to achieve the same thing. It's also quite exceptional: everything else the Matcher patterns refer to are token or lexeme attributes --- this is a distinct mechanism.

ohenrik · 2018-02-13T11:21:59Z

The regular expression is applied to the ORTH content, right?

@honnibal. It would be really nice if we could apply regex to other things than ORTH. For example i wuold like to find verb in infinitive by looking at the TAG. So i would like to match "VERB__inf" and "AUX__inf" for example.

honnibal · 2018-02-18T13:04:07Z

Closing this, as improved matcher plans are now getting done in #1971 🎉

savkov and others added 9 commits January 10, 2018 15:42

Added regex operator to the matcher

4b0f499

Added tests for the REGEX attribute

10e797d

Improved the algorithm and the tests

0a3f45d

Removed unnecessary setting of token value when regex is used

6e6c135

Removed lines

889c0a2

Removed more unnecessary lines

5bb254b

Reverted a comment

e7abe66

Optimised the matcher tests

6630e1a

Merge pull request #1 from Babylonpartners/regex_matcher

cd2e5d7

Regex attribute for Matcher patterns

ines added the enhancement Feature requests and improvements label Jan 12, 2018

ines mentioned this pull request Jan 14, 2018

Add TokensRegex functionality to Spacy #1567

Closed

ines added a commit that referenced this pull request Jan 14, 2018

Add regex section to rule-based matching docs (see #1567, #1833) [ci …

4daba3a

…skip]

ines mentioned this pull request Feb 12, 2018

💫 Better, faster and more customisable matcher #1971

Closed

5 tasks

honnibal closed this Feb 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGEX flag for the Matcher #1833

REGEX flag for the Matcher #1833

savkov commented Jan 12, 2018 •

edited

Loading

savkov commented Jan 12, 2018

honnibal commented Jan 14, 2018

ines commented Jan 14, 2018

savkov commented Jan 14, 2018

honnibal commented Jan 14, 2018

ohenrik commented Feb 13, 2018

honnibal commented Feb 18, 2018

REGEX flag for the Matcher #1833

REGEX flag for the Matcher #1833

Conversation

savkov commented Jan 12, 2018 • edited Loading

Description

Types of change

Checklist

savkov commented Jan 12, 2018

honnibal commented Jan 14, 2018

ines commented Jan 14, 2018

savkov commented Jan 14, 2018

honnibal commented Jan 14, 2018

ohenrik commented Feb 13, 2018

honnibal commented Feb 18, 2018

savkov commented Jan 12, 2018 •

edited

Loading