Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch between token rank and vocab vector find. #2871

Closed
jason-stein opened this issue Oct 22, 2018 · 3 comments
Closed

Mismatch between token rank and vocab vector find. #2871

jason-stein opened this issue Oct 22, 2018 · 3 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@jason-stein
Copy link

I have found a discrepancy between a token's rank and the lookup using tokenizer.vocab.vectors.find(), with the word "SUFFIX" (all caps). Furthermore, the index returned by .rank is causing range error in a tensorflow model trained on spaCy word vectors.

How to reproduce the behaviour

>>> import spacy
>>> nlp = spacy.load('en_core_web_lg')
>>> nlp('SUFFIX')[0].rank
684829
>>> nlp.tokenizer.vocab.vectors.find(key='SUFFIX')
-1
>>> nlp('suffix')[0].rank
31698
>>> nlp.tokenizer.vocab.vectors.find(key='suffix')
31698
>>> spacy.__version__
'2.0.12'
>>> en_core_web_lg.__version__
'2.0.0'
>>> 

The bug is in lines 3 and 4. The lowercase version matches, as expected. This was discovered when a tensorflow model threw the following error, indicating that this is an invalid rank:

StatusCode.INVALID_ARGUMENT, indices[0,6] = 684829 is not in [0, 684824)

Your Environment

  • Operating System: OSX 10.13.6
  • Python Version Used: 3.6.6
  • spaCy Version Used: 2.0.12
  • spaCy Model Version: en_core_web_lg 2.0.0
@jason-stein
Copy link
Author

This appears to be an issue around certain spaCy keywords. Similar discrepancies exist for ORTH and PREFIX.

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Oct 28, 2018
@honnibal
Copy link
Member

Thanks, I think I understand the problem here.

@lock
Copy link

lock bot commented Jan 9, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

2 participants