Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TokensRegex functionality to Spacy #1567

Closed
logisticDigressionSplitter opened this issue Nov 13, 2017 · 10 comments
Closed

Add TokensRegex functionality to Spacy #1567

logisticDigressionSplitter opened this issue Nov 13, 2017 · 10 comments
Labels
enhancement Feature requests and improvements help wanted Contributions welcome!

Comments

@logisticDigressionSplitter

It would be very helpful to have TokensRegex functionality similar to one in StanfordNLP
https://nlp.stanford.edu/software/tokensregex.html
This functionality provides immense flexibility extracting "interesting" patterns from text via patterns
Example: ([ner: PERSON]+) /was|is/ /an?/ []{0,3} /painter|artist/

Exact string match { word:"..." }
[ { word:"cat" } ] matches a token with text equal to "cat"
String regular expression match { word:/.../ }
[ { word:/cat|dog/ } ] matches a token with text "cat" or "dog"
Multiple attributes match { word:...; tag:... }
[ { word:/cat|dog/; tag:"NN" } ] matches a token with text "cat" or "dog" and POS tag is NN
Numeric expression match with ==, !=, >=, <=, >, <
[ { word>=4 } ] matches a token with text that has numeric value greater than or equal to 4.

@rulai-huajunzeng
Copy link

Same requirement here. The SUTime parser was built on top of TokensRegex. If that also can be migrated here that will be wonderful.

@ines ines added enhancement Feature requests and improvements help wanted Contributions welcome! labels Nov 14, 2017
@ines
Copy link
Member

ines commented Nov 14, 2017

Thanks for the suggestion – this is definitely very relevant and something we've been thinking about for a while. The Matcher doesn't exactly use regular expressions, but it lets you express things in a very similar way – especially since you're able to set and use custom flags. So the easiest solution would be to simply translate the TokensRegex patterns into matcher rules:

  • [{ word: "cat" }][{'ORTH': 'cat'}]
  • [{ word:/cat|dog/ }][{IS_DOG_CAT: True}] (temporary flag created for pattern)
  • [{word:/cat|dog/; tag:"NN"}][{IS_DOG_CAT: True, 'TAG': 'NN'}]
  • [{ word>=4 }][{IS_LEN_GREATER_OR_EQUALS_4: True}] (temporary flag with getter, obviously not written out like this)

Btw, speaking of Matcher enhancements: We're also going to be porting over the dependency pattern matching algorithm implemented by @raphael0202 – see #1120. It's currently only available in v1.10, but the plan is to port it over to v2.x, cythonize it and move it to the matcher.pyx (where it should probably live now, considering there's both a Matcher and PhraseMatcher in v2.0). We also need to fix the matcher bug described in #1503.

@damianoporta
Copy link

@ines what do you exactly mean with "temporary flag" ? custom token extension with getter?

@ines
Copy link
Member

ines commented Nov 18, 2017

@damianoporta Basically, this would use the same logic as the Vocab.add_flag() method (source here), which returns an integer ID between 1 and 63, and lets you assign a getter that takes the token text, and returns a boolean value. For example:

IS_DOG_CAT = nlp.vocab.add_flag(lambda text: text in ['dog', 'cat'])

assert nlp("i have a cat")[3].check_flag(IS_DOG_CAT) == True  # check on token
pattern = [{IS_DOG_CAT: True}]  # use in matcher patterns

If you set a flag ID, it will be added to the lex_attr_getters, which also includes the other, built-in flags like IS_STOP etc. There are only 64 slots available though – and since the TokensRegex can include all kinds of arbitrary combinations, we obviously don't want to just create and keep all those random, arbitrary flags. That's why I called them "temporary": they should be created when we need them, and cleaned up again afterwards. So after the matcher is done, the flag IDs it set will have to be removed from the lex_attr_getters so they can be reused again later.

@damianoporta
Copy link

@ines is there a way to return another type of value instead of bool ?

@ines
Copy link
Member

ines commented Nov 20, 2017

No, by design, flags can only return boolean values.

@yarongon
Copy link

I don't know if it was published anywhere, or if its an official solution, but I found a simple and quite elegant solution for matching tokens with regular expressions using the Doc.char_span (link to API) method. Here's a code example:

import spacy
import re
nlp = spacy.load('en', disable=['parser', 'tagger', 'ner'])
doc = nlp("This is a number: 5634. This is another number: 90.")

NUM_PATTERN = re.compile(r"\d+")

for match in re.finditer(NUM_PATTERN, doc.text):
    start, end = match.span()
    print(f"The matched text: '{doc.text[start:end]}'")
    span = doc.char_span(start, end)
    # Now you have a Span object and you can do everything.

@ines
Copy link
Member

ines commented Jan 14, 2018

@yarongon Thanks for sharing your code – and yes, this is definitely recommended usage. And I agree, this should be mentioned in the docs. Maybe we should have a subsection of "Rule-based matching" that shows examples of using regular expressions. It's not exactly related to the Matcher, but I think that section is where people would most likely be looking.

You might also be interested in #1833, which discusses solutions for integrating regular expressions with the rule-based matcher. We're hoping to add something like this to spaCy soon!

@ines
Copy link
Member

ines commented Feb 12, 2018

Merging this with the master issue #1971!

@ines ines closed this as completed Feb 12, 2018
@lock
Copy link

lock bot commented May 7, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements help wanted Contributions welcome!
Projects
None yet
Development

No branches or pull requests

5 participants