-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't SHAPE match against strings of digits that have leading 0's. #4615
Comments
Another weird case:
produces
Note how |
I thought this would be a bug, but it looks like it's the intended behavior. The current shape definition caps more than four of the same character type in a row to four in the output:
Lines 150 to 174 in 4d85f67
So it's not related to leading zeros. This isn't obvious behavior and it should be documented properly in the |
Sorry for the lack of docs, this has always been the way word shape features have behaved, to reduce sparsity. I took this definition from some paper, I think either the Ratinov et al (2009) one, or possibly an early paper from Stanford. |
Note that the German phone number example (2nd code example below https://spacy.io/usage/rule-based-matching#example2 ) shouldn't work since it uses |
A full description would be something like:
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
The
SHAPE
matching against tokens comprised of digits that have leading zeros, e.g.01234
doesn't work correctly.What I'd expect is that the pattern
[{'SHAPE':'ddddd'}]
would match against the token01234
; this is not the case. Another way to say it is that you can have a token withtext=="01234"
whoseshape=="ddddd"
, but does not match that pattern.How to reproduce the behaviour
Your Environment
The text was updated successfully, but these errors were encountered: