Mismatch between token rank and vocab vector find. #2871

jason-stein · 2018-10-22T14:59:22Z

I have found a discrepancy between a token's rank and the lookup using tokenizer.vocab.vectors.find(), with the word "SUFFIX" (all caps). Furthermore, the index returned by .rank is causing range error in a tensorflow model trained on spaCy word vectors.

How to reproduce the behaviour

>>> import spacy
>>> nlp = spacy.load('en_core_web_lg')
>>> nlp('SUFFIX')[0].rank
684829
>>> nlp.tokenizer.vocab.vectors.find(key='SUFFIX')
-1
>>> nlp('suffix')[0].rank
31698
>>> nlp.tokenizer.vocab.vectors.find(key='suffix')
31698
>>> spacy.__version__
'2.0.12'
>>> en_core_web_lg.__version__
'2.0.0'
>>>

The bug is in lines 3 and 4. The lowercase version matches, as expected. This was discovered when a tensorflow model threw the following error, indicating that this is an invalid rank:

StatusCode.INVALID_ARGUMENT, indices[0,6] = 684829 is not in [0, 684824)

Your Environment

Operating System: OSX 10.13.6
Python Version Used: 3.6.6
spaCy Version Used: 2.0.12
spaCy Model Version: en_core_web_lg 2.0.0

The text was updated successfully, but these errors were encountered:

jason-stein · 2018-10-22T15:02:40Z

This appears to be an issue around certain spaCy keywords. Similar discrepancies exist for ORTH and PREFIX.

honnibal · 2018-10-28T14:38:17Z

Thanks, I think I understand the problem here.

lock · 2019-01-09T15:12:37Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the bug Bugs and behaviour differing from documentation label Oct 28, 2018

honnibal added a commit that referenced this issue Dec 10, 2018

Add test for issue #2871 -- vectors for reserved words

cc1ea03

honnibal added a commit that referenced this issue Dec 10, 2018

Fix vectors for reserved words. Closes #2871

90aec6d

honnibal closed this as completed Dec 10, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch between token rank and vocab vector find. #2871

Mismatch between token rank and vocab vector find. #2871

jason-stein commented Oct 22, 2018

jason-stein commented Oct 22, 2018

honnibal commented Oct 28, 2018

lock bot commented Jan 9, 2019

Mismatch between token rank and vocab vector find. #2871

Mismatch between token rank and vocab vector find. #2871

Comments

jason-stein commented Oct 22, 2018

How to reproduce the behaviour

Your Environment

jason-stein commented Oct 22, 2018

honnibal commented Oct 28, 2018

lock bot commented Jan 9, 2019