-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected behaviour of tokenizer.add_special_case
method
#2728
Comments
Sorry for only getting to this now. I fixed the first example in the documentation and a bug that was actually related to displaCy and how the HTML was escaped or not escaped. And yes, setting entity types via the tokenizer exceptions isn't supported because this can easily lead to confusing results and only really works for special case rules where the token text contains the full entity. For example,
spaCy should feel more gracefully now, but the underlying problem here is likely an internal mismatch of attributes, because spaCy doesn't expect the entity tags being overwritten in the tokenizer. So you probably want to do this kind of stuff in a separate step using the rule-based |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behaviour
I want to add special case to preserve the custom tag as a single token.
Problem no.1: Documentation example raises
TypeError
raises:
Problem no.2: Token is missing from entity rendering
NOTE: Working in Jupyter notebook
Note that the
<RELEASE>
part is completely missing.Problem no.3: Explicitly specifying custom ENT_TYPE and ENT_IOB causes jupyter kernel to crash
I haven't found any documentation regarding this, so I just tried adding the following to the
add_special_case
tokenizer's method.After issuing
doc.ents
or trying to access tokenent_type_
orent_iob_
, jupyter kernel crashes.Problem no.4: Explicitly assign entity to special case
Has no effect, the tokens
ent_type_
is still''
andent_iob_
is'O'
.Environment
The text was updated successfully, but these errors were encountered: