Tokenizer.add_special_case raises KeyError #656

soldni · 2016-11-23T17:57:31Z

The usage example provided in the documentation for Tokenizer.add_special_case raises a KeyError.

Steps to reproduce:

import spacy
from spacy.symbols import ORTH, LEMMA, POS

nlp = spacy.load('en')

nlp.tokenizer.add_special_case(u'gimme',
    [
        {
            ORTH: u'gim',
            LEMMA: u'give',
            POS: u'VERB'},
        {
            ORTH: u'me' }])

# Traceback (most recent call last):
#   File "test.py", line 13, in <module>
#     ORTH: u'me' }])
#   File "spacy/tokenizer.pyx", line 377, in spacy.tokenizer.Tokenizer.add_special_case (spacy/tokenizer.cpp:8460)
#  File "spacy/vocab.pyx", line 340, in spacy.vocab.Vocab.make_fused_token (spacy/vocab.cpp:7879)
# KeyError: 'F'

Environment

Operating System: Ubuntu 16.04 / macOS 10.12.1
Python Version Used: CPython 3.5.2
spaCy Version Used: 1.2.0
Environment Information: n/a

The text was updated successfully, but these errors were encountered:

soldni · 2016-11-23T18:16:03Z

A bit of follow up: I was going through the definition for spacy.vocab.Vocab.make_fused_token, and it seems that code expects ORTH to be equal to 'F', POS to be equal to 'pos', and LEMMA to be equal to 'L'; however, ORTH equals 65, POS equals 74, and LEMMA equals 73.

I am not sure if the values expected by make_fused_tokens are intentionally different from those defined in spacy.symbols.

EDIT: Even when replacing keys for token_attrs argument as described above, I still encounter an error:

import spacy

nlp = spacy.load('en')

nlp.tokenizer.add_special_case('gimme',
    [
        {
            'F': 'gim',
            'L': 'give',
            'pos': 'VERB'},
        {
            'F': 'me' }])

# Traceback (most recent call last):
#  File "test.py", line 13, in <module>
#    'F': 'me' }])
#  File "spacy/tokenizer.pyx", line 377, in spacy.tokenizer.Tokenizer.add_special_case (spacy/tokenizer.cpp:8460)
#  File "spacy/vocab.pyx", line 342, in spacy.vocab.Vocab.make_fused_token (spacy/vocab.cpp:7907)
#  File "spacy/morphology.pyx", line 39, in spacy.morphology.Morphology.assign_tag (spacy/morphology.cpp:3919)
# KeyError: 97

honnibal · 2016-11-23T23:17:56Z

Thanks for this.

The docs have gotten ahead of the library here — the current/old behaviour is pretty inconsistent, so I wrote up the intended usage, but haven't had time to fix it yet. Will definitely have this resolved in the next release, which should be up this week.

lock · 2018-05-09T06:38:19Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the bug Bugs and behaviour differing from documentation label Nov 23, 2016

soldni closed this as completed Nov 23, 2016

soldni reopened this Nov 23, 2016

honnibal closed this as completed in 1e0f566 Nov 25, 2016

honnibal added a commit that referenced this issue Nov 25, 2016

Test #656, #624: special case rules for tokenizer with attributes.

6652f2a

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer.add_special_case raises KeyError #656

Tokenizer.add_special_case raises KeyError #656

soldni commented Nov 23, 2016 •

edited

Loading

soldni commented Nov 23, 2016 •

edited

Loading

honnibal commented Nov 23, 2016

lock bot commented May 9, 2018

Tokenizer.add_special_case raises KeyError #656

Tokenizer.add_special_case raises KeyError #656

Comments

soldni commented Nov 23, 2016 • edited Loading

Environment

soldni commented Nov 23, 2016 • edited Loading

honnibal commented Nov 23, 2016

lock bot commented May 9, 2018

soldni commented Nov 23, 2016 •

edited

Loading

soldni commented Nov 23, 2016 •

edited

Loading