EN Tokenizer Error: 'shell' tokenized as 'she', 'll', etc. #847

rappdw · 2017-02-18T13:06:57Z

The 1.6.0 tokenizer is incorrectly tokenizing words that have a 'she' prefix.

Examples:
'This sea shell is unique', tokenizes 'shell' as 'she', 'll'
'The shovel is in the shed', tokenizes 'shed' as 'she', 'd'

Your Environment

Operating System: OSX & Linux
Python Version Used: 3.6.0
spaCy Version Used: 1.6.0
Environment Information:

ines · 2017-02-18T13:12:23Z

Thanks for the report! The shell error was already fixed in #775, and I just added another test case for shed.

If you install from master, it should be fixed now – we'll also make a bug fix release soon that will include those changes.

rappdw · 2017-02-18T13:28:35Z

Thanks.

Looking at en.tokenizer_exceptions.EXCLUDE_EXC, there is perhaps one other case that should be added, id. As in "The id and the ego...". Currently 'id' tokenizes as 'i', 'd' and shouldn't in this case.

ines · 2017-02-18T13:45:08Z

Thanks, good point! Thinking about it, this is actually a tricky one... In general, we prefer to base the default tokenizer exceptions on what's most common. If we come across id in most genres in English, it's more likely to mean "i'd" than the Freudian concept of "id".

To deal with this problem, we've been thinking about adding a new method to the Tokenizer that lets you remove and replace exceptions. This would let you customise the tokenizer for the genre of text you're working with, from simple exception rules like this one to more complex regular expressions for punctuation. (At the moment, the only way to do this is to create a custom Tokenizer instance and override the defaults, which is not ideal.)

lock · 2018-05-09T02:39:05Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added a commit that referenced this issue Feb 18, 2017

Exclude "shed" and "Shed" from tokenizer exceptions (see #847)

30ce2a6

ines added a commit that referenced this issue Feb 18, 2017

Add more test cases to #775 regression test to cover #847

67991b6

ines closed this as completed Feb 18, 2017

ines added lang / en English language data and models performance labels Feb 18, 2017

ines mentioned this issue Feb 27, 2017

Tokenization of "shed" #861

Closed

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EN Tokenizer Error: 'shell' tokenized as 'she', 'll', etc. #847

EN Tokenizer Error: 'shell' tokenized as 'she', 'll', etc. #847

rappdw commented Feb 18, 2017

ines commented Feb 18, 2017

rappdw commented Feb 18, 2017

ines commented Feb 18, 2017

lock bot commented May 9, 2018

EN Tokenizer Error: 'shell' tokenized as 'she', 'll', etc. #847

EN Tokenizer Error: 'shell' tokenized as 'she', 'll', etc. #847

Comments

rappdw commented Feb 18, 2017

Your Environment

ines commented Feb 18, 2017

rappdw commented Feb 18, 2017

ines commented Feb 18, 2017

lock bot commented May 9, 2018