-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TokensRegex functionality to Spacy #1567
Comments
Same requirement here. The SUTime parser was built on top of TokensRegex. If that also can be migrated here that will be wonderful. |
Thanks for the suggestion – this is definitely very relevant and something we've been thinking about for a while. The
Btw, speaking of |
@ines what do you exactly mean with "temporary flag" ? custom token extension with getter? |
@damianoporta Basically, this would use the same logic as the IS_DOG_CAT = nlp.vocab.add_flag(lambda text: text in ['dog', 'cat'])
assert nlp("i have a cat")[3].check_flag(IS_DOG_CAT) == True # check on token
pattern = [{IS_DOG_CAT: True}] # use in matcher patterns If you set a flag ID, it will be added to the |
@ines is there a way to return another type of value instead of bool ? |
No, by design, flags can only return boolean values. |
I don't know if it was published anywhere, or if its an official solution, but I found a simple and quite elegant solution for matching tokens with regular expressions using the import spacy
import re
nlp = spacy.load('en', disable=['parser', 'tagger', 'ner'])
doc = nlp("This is a number: 5634. This is another number: 90.")
NUM_PATTERN = re.compile(r"\d+")
for match in re.finditer(NUM_PATTERN, doc.text):
start, end = match.span()
print(f"The matched text: '{doc.text[start:end]}'")
span = doc.char_span(start, end)
# Now you have a Span object and you can do everything. |
@yarongon Thanks for sharing your code – and yes, this is definitely recommended usage. And I agree, this should be mentioned in the docs. Maybe we should have a subsection of "Rule-based matching" that shows examples of using regular expressions. It's not exactly related to the You might also be interested in #1833, which discusses solutions for integrating regular expressions with the rule-based matcher. We're hoping to add something like this to spaCy soon! |
Merging this with the master issue #1971! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
It would be very helpful to have TokensRegex functionality similar to one in StanfordNLP
https://nlp.stanford.edu/software/tokensregex.html
This functionality provides immense flexibility extracting "interesting" patterns from text via patterns
Example: ([ner: PERSON]+) /was|is/ /an?/ []{0,3} /painter|artist/
Exact string match { word:"..." }
[ { word:"cat" } ] matches a token with text equal to "cat"
String regular expression match { word:/.../ }
[ { word:/cat|dog/ } ] matches a token with text "cat" or "dog"
Multiple attributes match { word:...; tag:... }
[ { word:/cat|dog/; tag:"NN" } ] matches a token with text "cat" or "dog" and POS tag is NN
Numeric expression match with ==, !=, >=, <=, >, <
[ { word>=4 } ] matches a token with text that has numeric value greater than or equal to 4.
The text was updated successfully, but these errors were encountered: