Unexpected behaviour of `tokenizer.add_special_case` method #2728

CermakM · 2018-09-03T11:45:40Z

How to reproduce the behaviour

import spacy

nlp = spacy.load('en_core_web_md')  # load model with vectors
sample = "RELEASE <RELEASE>"

I want to add special case to preserve the custom tag as a single token.

Problem no.1: Documentation example raises `TypeError`

from spacy.attrs import ORTH, LEMMA

case = [{"don't": [{ORTH: "do"}, {ORTH: "n't", LEMMA: "not"}]}]
tokenizer.add_special_case(case)

raises:

TypeError: add_special_case() takes exactly 2 positional arguments (1 given)

Problem no.2: Token is missing from entity rendering

NOTE: Working in Jupyter notebook

case = [{ORTH: '<RELEASE>', LEMMA: 'RELEASE'}]
nlp.tokenizer.add_special_case(u'<RELEASE>', case)

doc = nlp(sample)  # tokenization successful, '<RELEASE>' is a single token

spacy.displacy.render(doc, style='ent', options={'compact': True}, jupyter=True)
# outputs: "Release "

Note that the <RELEASE> part is completely missing.

Problem no.3: Explicitly specifying custom ENT_TYPE and ENT_IOB causes jupyter kernel to crash

I haven't found any documentation regarding this, so I just tried adding the following to the add_special_case tokenizer's method.

case = [{ORTH: '<RELEASE>', LEMMA: 'RELEASE', ENT_TYPE: 'CUSTOM', ENT_IOB: 'B'}]
nlp.tokenizer.add_special_case(u'<RELEASE>', case)

doc = nlp(sample)

After issuing doc.ents or trying to access token ent_type_ or ent_iob_, jupyter kernel crashes.

Problem no.4: Explicitly assign entity to special case

case = [{ORTH: '<RELEASE>', LEMMA: 'RELEASE', ENT_TYPE: 'CARDINAL', ENT_IOB: 'B'}]
nlp.tokenizer.add_special_case(u'<RELEASE>', case)

doc = nlp(sample)

Has no effect, the tokens ent_type_ is still '' and ent_iob_ is 'O'.

Environment

Operating System: Fedora 28
Python Version Used: 3.6.5
spaCy Version Used: 2.0.12
Environment Information: Jupyter notebook

The text was updated successfully, but these errors were encountered:

ines · 2019-02-21T13:30:19Z

Sorry for only getting to this now. I fixed the first example in the documentation and a bug that was actually related to displaCy and how the HTML was escaped or not escaped.

And yes, setting entity types via the tokenizer exceptions isn't supported because this can easily lead to confusing results and only really works for special case rules where the token text contains the full entity. For example, I on its own can't exist

After issuing doc.ents or trying to access token ent_type_ or ent_iob_, jupyter kernel crashes.

spaCy should feel more gracefully now, but the underlying problem here is likely an internal mismatch of attributes, because spaCy doesn't expect the entity tags being overwritten in the tokenizer.

So you probably want to do this kind of stuff in a separate step using the rule-based Matcher. You could even use your special case rules as entries in the match patterns, because the follow the same style.

lock · 2019-03-23T14:21:30Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added docs Documentation and website feat / tokenizer Feature: Tokenizer labels Sep 5, 2018

ines added a commit that referenced this issue Feb 21, 2019

Fix docs example (see #2728)

250e88e

ines added a commit that referenced this issue Feb 21, 2019

Fix escaping of HTML in displacy ENT (closes #2728)

80bdcb9

ines closed this as completed Feb 21, 2019

lock bot locked as resolved and limited conversation to collaborators Mar 23, 2019

ljvmiranda921 added the feat / visualizers Feature: Built-in displaCy and other visualizers label Nov 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behaviour of `tokenizer.add_special_case` method #2728

Unexpected behaviour of `tokenizer.add_special_case` method #2728

CermakM commented Sep 3, 2018

ines commented Feb 21, 2019

lock bot commented Mar 23, 2019

Unexpected behaviour of tokenizer.add_special_case method #2728

Unexpected behaviour of tokenizer.add_special_case method #2728

Comments

CermakM commented Sep 3, 2018

How to reproduce the behaviour

Problem no.1: Documentation example raises TypeError

Problem no.2: Token is missing from entity rendering

Problem no.3: Explicitly specifying custom ENT_TYPE and ENT_IOB causes jupyter kernel to crash

Problem no.4: Explicitly assign entity to special case

Environment

ines commented Feb 21, 2019

lock bot commented Mar 23, 2019

Unexpected behaviour of `tokenizer.add_special_case` method #2728

Unexpected behaviour of `tokenizer.add_special_case` method #2728

Problem no.1: Documentation example raises `TypeError`