Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer.add_special_case raises KeyError #656

Closed
soldni opened this issue Nov 23, 2016 · 3 comments
Closed

Tokenizer.add_special_case raises KeyError #656

soldni opened this issue Nov 23, 2016 · 3 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@soldni
Copy link

soldni commented Nov 23, 2016

The usage example provided in the documentation for Tokenizer.add_special_case raises a KeyError.

Steps to reproduce:

import spacy
from spacy.symbols import ORTH, LEMMA, POS

nlp = spacy.load('en')

nlp.tokenizer.add_special_case(u'gimme',
    [
        {
            ORTH: u'gim',
            LEMMA: u'give',
            POS: u'VERB'},
        {
            ORTH: u'me' }])

# Traceback (most recent call last):
#   File "test.py", line 13, in <module>
#     ORTH: u'me' }])
#   File "spacy/tokenizer.pyx", line 377, in spacy.tokenizer.Tokenizer.add_special_case (spacy/tokenizer.cpp:8460)
#  File "spacy/vocab.pyx", line 340, in spacy.vocab.Vocab.make_fused_token (spacy/vocab.cpp:7879)
# KeyError: 'F'

Environment

  • Operating System: Ubuntu 16.04 / macOS 10.12.1
  • Python Version Used: CPython 3.5.2
  • spaCy Version Used: 1.2.0
  • Environment Information: n/a
@soldni
Copy link
Author

soldni commented Nov 23, 2016

A bit of follow up: I was going through the definition for spacy.vocab.Vocab.make_fused_token, and it seems that code expects ORTH to be equal to 'F', POS to be equal to 'pos', and LEMMA to be equal to 'L'; however, ORTH equals 65, POS equals 74, and LEMMA equals 73.

I am not sure if the values expected by make_fused_tokens are intentionally different from those defined in spacy.symbols.

EDIT: Even when replacing keys for token_attrs argument as described above, I still encounter an error:

import spacy

nlp = spacy.load('en')

nlp.tokenizer.add_special_case('gimme',
    [
        {
            'F': 'gim',
            'L': 'give',
            'pos': 'VERB'},
        {
            'F': 'me' }])

# Traceback (most recent call last):
#  File "test.py", line 13, in <module>
#    'F': 'me' }])
#  File "spacy/tokenizer.pyx", line 377, in spacy.tokenizer.Tokenizer.add_special_case (spacy/tokenizer.cpp:8460)
#  File "spacy/vocab.pyx", line 342, in spacy.vocab.Vocab.make_fused_token (spacy/vocab.cpp:7907)
#  File "spacy/morphology.pyx", line 39, in spacy.morphology.Morphology.assign_tag (spacy/morphology.cpp:3919)
# KeyError: 97

@honnibal
Copy link
Member

Thanks for this.

The docs have gotten ahead of the library here — the current/old behaviour is pretty inconsistent, so I wrote up the intended usage, but haven't had time to fix it yet. Will definitely have this resolved in the next release, which should be up this week.

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Nov 23, 2016
@soldni soldni closed this as completed Nov 23, 2016
@soldni soldni reopened this Nov 23, 2016
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

2 participants