-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vocab["-0.23"].like_num is False #2782
Comments
Yes. I am also facing the same issue but with token objects. Negative numbers are tagged as PUNCT Did you find any work around for this?
|
I checked is_num feature for you: https:/explosion/spaCy/blob/master/spacy/lang/en/lex_attrs.py As far as I see, minus sign in front is not parsed. @ines what do you say? |
FYI, a leading "+" also is not parsed. Maybe both should just be included? |
Sure, I meant no "sign bit" is included in parsing 😉 We can skip the initial sign character during the parse. |
I already tried like_num attribute in token class, but it is not working. Currently I am doing this
|
This would be good to handle, yes. I'd propose the following: def like_num(text):
if text.startswith('+') or text.startswith('-'):
text = text[1:]
# rest of the function |
Btw @ines , startswith can take a tuple as argument when I first found about it, I instantly fall in love ❤️ So this is possible:
|
Ahh, nice! That's even better. I'm just writing some tests for this for all languages that implement |
What about possibly also handling "~5", etc? Also, "±1" ?I know we're getting into the grey area here, and this isn't very high priority, but those kind of cases might also be interesting. Of course at some point, it's probably up to the user to just handle this or use a regex for the use case. + and - are definitely a good improvement in the general case though. |
From my side, why not... additions can go in a similar fashion. Then most probably we'd like to do:
|
Sure! Just tested it and it seems to work as expected – it was only a case of adding more characters to the I also noticed that the tokenizer was currently always splitting the |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Pretty straight forward, it seems like negative numbers are currently not flagged as numbers in the Lexeme object.
The text was updated successfully, but these errors were encountered: