Add TokensRegex functionality to Spacy #1567

logisticDigressionSplitter · 2017-11-13T21:03:16Z

It would be very helpful to have TokensRegex functionality similar to one in StanfordNLP
https://nlp.stanford.edu/software/tokensregex.html
This functionality provides immense flexibility extracting "interesting" patterns from text via patterns
Example: ([ner: PERSON]+) /was|is/ /an?/ []{0,3} /painter|artist/

Exact string match { word:"..." }
[ { word:"cat" } ] matches a token with text equal to "cat"
String regular expression match { word:/.../ }
[ { word:/cat|dog/ } ] matches a token with text "cat" or "dog"
Multiple attributes match { word:...; tag:... }
[ { word:/cat|dog/; tag:"NN" } ] matches a token with text "cat" or "dog" and POS tag is NN
Numeric expression match with ==, !=, >=, <=, >, <
[ { word>=4 } ] matches a token with text that has numeric value greater than or equal to 4.

rulai-huajunzeng · 2017-11-14T07:45:44Z

Same requirement here. The SUTime parser was built on top of TokensRegex. If that also can be migrated here that will be wonderful.

ines · 2017-11-14T10:03:17Z

Thanks for the suggestion – this is definitely very relevant and something we've been thinking about for a while. The Matcher doesn't exactly use regular expressions, but it lets you express things in a very similar way – especially since you're able to set and use custom flags. So the easiest solution would be to simply translate the TokensRegex patterns into matcher rules:

[{ word: "cat" }] → [{'ORTH': 'cat'}]
[{ word:/cat|dog/ }] → [{IS_DOG_CAT: True}] (temporary flag created for pattern)
[{word:/cat|dog/; tag:"NN"}] → [{IS_DOG_CAT: True, 'TAG': 'NN'}]
[{ word>=4 }] → [{IS_LEN_GREATER_OR_EQUALS_4: True}] (temporary flag with getter, obviously not written out like this)

Btw, speaking of Matcher enhancements: We're also going to be porting over the dependency pattern matching algorithm implemented by @raphael0202 – see #1120. It's currently only available in v1.10, but the plan is to port it over to v2.x, cythonize it and move it to the matcher.pyx (where it should probably live now, considering there's both a Matcher and PhraseMatcher in v2.0). We also need to fix the matcher bug described in #1503.

damianoporta · 2017-11-17T23:41:22Z

@ines what do you exactly mean with "temporary flag" ? custom token extension with getter?

ines · 2017-11-18T01:43:08Z

@damianoporta Basically, this would use the same logic as the Vocab.add_flag() method (source here), which returns an integer ID between 1 and 63, and lets you assign a getter that takes the token text, and returns a boolean value. For example:

IS_DOG_CAT = nlp.vocab.add_flag(lambda text: text in ['dog', 'cat'])

assert nlp("i have a cat")[3].check_flag(IS_DOG_CAT) == True  # check on token
pattern = [{IS_DOG_CAT: True}]  # use in matcher patterns

If you set a flag ID, it will be added to the lex_attr_getters, which also includes the other, built-in flags like IS_STOP etc. There are only 64 slots available though – and since the TokensRegex can include all kinds of arbitrary combinations, we obviously don't want to just create and keep all those random, arbitrary flags. That's why I called them "temporary": they should be created when we need them, and cleaned up again afterwards. So after the matcher is done, the flag IDs it set will have to be removed from the lex_attr_getters so they can be reused again later.

damianoporta · 2017-11-20T13:13:39Z

@ines is there a way to return another type of value instead of bool ?

ines · 2017-11-20T13:45:15Z

No, by design, flags can only return boolean values.

yarongon · 2018-01-10T12:32:50Z

I don't know if it was published anywhere, or if its an official solution, but I found a simple and quite elegant solution for matching tokens with regular expressions using the Doc.char_span (link to API) method. Here's a code example:

import spacy
import re
nlp = spacy.load('en', disable=['parser', 'tagger', 'ner'])
doc = nlp("This is a number: 5634. This is another number: 90.")

NUM_PATTERN = re.compile(r"\d+")

for match in re.finditer(NUM_PATTERN, doc.text):
    start, end = match.span()
    print(f"The matched text: '{doc.text[start:end]}'")
    span = doc.char_span(start, end)
    # Now you have a Span object and you can do everything.

ines · 2018-01-14T12:30:37Z

@yarongon Thanks for sharing your code – and yes, this is definitely recommended usage. And I agree, this should be mentioned in the docs. Maybe we should have a subsection of "Rule-based matching" that shows examples of using regular expressions. It's not exactly related to the Matcher, but I think that section is where people would most likely be looking.

You might also be interested in #1833, which discusses solutions for integrating regular expressions with the rule-based matcher. We're hoping to add something like this to spaCy soon!

…skip]

ines · 2018-02-12T11:11:52Z

Merging this with the master issue #1971!

lock · 2018-05-07T23:55:05Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added enhancement Feature requests and improvements help wanted Contributions welcome! labels Nov 14, 2017

yarongon mentioned this issue Jan 10, 2018

[Feature request] Adding regex patterns to rule based matching #882

Closed

ines mentioned this issue Jan 14, 2018

REGEX flag for the Matcher #1833

Closed

3 tasks

ines added a commit that referenced this issue Jan 14, 2018

Add regex section to rule-based matching docs (see #1567, #1833) [ci …

4daba3a

…skip]

ines mentioned this issue Feb 12, 2018

💫 Better, faster and more customisable matcher #1971

Closed

5 tasks

ines closed this as completed Feb 12, 2018

lock bot locked as resolved and limited conversation to collaborators May 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TokensRegex functionality to Spacy #1567

Add TokensRegex functionality to Spacy #1567

logisticDigressionSplitter commented Nov 13, 2017

rulai-huajunzeng commented Nov 14, 2017

ines commented Nov 14, 2017

damianoporta commented Nov 17, 2017

ines commented Nov 18, 2017

damianoporta commented Nov 20, 2017

ines commented Nov 20, 2017

yarongon commented Jan 10, 2018

ines commented Jan 14, 2018

ines commented Feb 12, 2018

lock bot commented May 7, 2018

Add TokensRegex functionality to Spacy #1567

Add TokensRegex functionality to Spacy #1567

Comments

logisticDigressionSplitter commented Nov 13, 2017

rulai-huajunzeng commented Nov 14, 2017

ines commented Nov 14, 2017

damianoporta commented Nov 17, 2017

ines commented Nov 18, 2017

damianoporta commented Nov 20, 2017

ines commented Nov 20, 2017

yarongon commented Jan 10, 2018

ines commented Jan 14, 2018

ines commented Feb 12, 2018

lock bot commented May 7, 2018