Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behaviour of tokenizer.add_special_case method #2728

Closed
CermakM opened this issue Sep 3, 2018 · 2 comments
Closed

Unexpected behaviour of tokenizer.add_special_case method #2728

CermakM opened this issue Sep 3, 2018 · 2 comments
Labels
docs Documentation and website feat / tokenizer Feature: Tokenizer feat / visualizers Feature: Built-in displaCy and other visualizers

Comments

@CermakM
Copy link

CermakM commented Sep 3, 2018

How to reproduce the behaviour

import spacy

nlp = spacy.load('en_core_web_md')  # load model with vectors
sample = "RELEASE <RELEASE>"

I want to add special case to preserve the custom tag as a single token.


Problem no.1: Documentation example raises TypeError

from spacy.attrs import ORTH, LEMMA

case = [{"don't": [{ORTH: "do"}, {ORTH: "n't", LEMMA: "not"}]}]
tokenizer.add_special_case(case)

raises:

TypeError: add_special_case() takes exactly 2 positional arguments (1 given)

Problem no.2: Token is missing from entity rendering

NOTE: Working in Jupyter notebook

case = [{ORTH: '<RELEASE>', LEMMA: 'RELEASE'}]
nlp.tokenizer.add_special_case(u'<RELEASE>', case)

doc = nlp(sample)  # tokenization successful, '<RELEASE>' is a single token

spacy.displacy.render(doc, style='ent', options={'compact': True}, jupyter=True)
# outputs: "Release "

Note that the <RELEASE> part is completely missing.


Problem no.3: Explicitly specifying custom ENT_TYPE and ENT_IOB causes jupyter kernel to crash

I haven't found any documentation regarding this, so I just tried adding the following to the add_special_case tokenizer's method.

case = [{ORTH: '<RELEASE>', LEMMA: 'RELEASE', ENT_TYPE: 'CUSTOM', ENT_IOB: 'B'}]
nlp.tokenizer.add_special_case(u'<RELEASE>', case)

doc = nlp(sample)

After issuing doc.ents or trying to access token ent_type_ or ent_iob_, jupyter kernel crashes.


Problem no.4: Explicitly assign entity to special case

case = [{ORTH: '<RELEASE>', LEMMA: 'RELEASE', ENT_TYPE: 'CARDINAL', ENT_IOB: 'B'}]
nlp.tokenizer.add_special_case(u'<RELEASE>', case)

doc = nlp(sample)

Has no effect, the tokens ent_type_ is still '' and ent_iob_ is 'O'.


Environment

  • Operating System: Fedora 28
  • Python Version Used: 3.6.5
  • spaCy Version Used: 2.0.12
  • Environment Information: Jupyter notebook
@ines ines added docs Documentation and website feat / tokenizer Feature: Tokenizer labels Sep 5, 2018
@ines
Copy link
Member

ines commented Feb 21, 2019

Sorry for only getting to this now. I fixed the first example in the documentation and a bug that was actually related to displaCy and how the HTML was escaped or not escaped.

And yes, setting entity types via the tokenizer exceptions isn't supported because this can easily lead to confusing results and only really works for special case rules where the token text contains the full entity. For example, I on its own can't exist

After issuing doc.ents or trying to access token ent_type_ or ent_iob_, jupyter kernel crashes.

spaCy should feel more gracefully now, but the underlying problem here is likely an internal mismatch of attributes, because spaCy doesn't expect the entity tags being overwritten in the tokenizer.

So you probably want to do this kind of stuff in a separate step using the rule-based Matcher. You could even use your special case rules as entries in the match patterns, because the follow the same style.

ines added a commit that referenced this issue Feb 21, 2019
@ines ines closed this as completed Feb 21, 2019
@lock
Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 23, 2019
@ljvmiranda921 ljvmiranda921 added the feat / visualizers Feature: Built-in displaCy and other visualizers label Nov 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website feat / tokenizer Feature: Tokenizer feat / visualizers Feature: Built-in displaCy and other visualizers
Projects
None yet
Development

No branches or pull requests

3 participants