Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EN Tokenizer Error: 'shell' tokenized as 'she', 'll', etc. #847

Closed
rappdw opened this issue Feb 18, 2017 · 4 comments
Closed

EN Tokenizer Error: 'shell' tokenized as 'she', 'll', etc. #847

rappdw opened this issue Feb 18, 2017 · 4 comments
Labels
lang / en English language data and models

Comments

@rappdw
Copy link
Contributor

rappdw commented Feb 18, 2017

The 1.6.0 tokenizer is incorrectly tokenizing words that have a 'she' prefix.

Examples:
'This sea shell is unique', tokenizes 'shell' as 'she', 'll'
'The shovel is in the shed', tokenizes 'shed' as 'she', 'd'

Your Environment

  • Operating System: OSX & Linux
  • Python Version Used: 3.6.0
  • spaCy Version Used: 1.6.0
  • Environment Information:
@ines
Copy link
Member

ines commented Feb 18, 2017

Thanks for the report! The shell error was already fixed in #775, and I just added another test case for shed.

If you install from master, it should be fixed now – we'll also make a bug fix release soon that will include those changes.

@ines ines closed this as completed Feb 18, 2017
@ines ines added lang / en English language data and models performance labels Feb 18, 2017
@rappdw
Copy link
Contributor Author

rappdw commented Feb 18, 2017

Thanks.

Looking at en.tokenizer_exceptions.EXCLUDE_EXC, there is perhaps one other case that should be added, id. As in "The id and the ego...". Currently 'id' tokenizes as 'i', 'd' and shouldn't in this case.

@ines
Copy link
Member

ines commented Feb 18, 2017

Thanks, good point! Thinking about it, this is actually a tricky one... In general, we prefer to base the default tokenizer exceptions on what's most common. If we come across id in most genres in English, it's more likely to mean "i'd" than the Freudian concept of "id".

To deal with this problem, we've been thinking about adding a new method to the Tokenizer that lets you remove and replace exceptions. This would let you customise the tokenizer for the genre of text you're working with, from simple exception rules like this one to more complex regular expressions for punctuation. (At the moment, the only way to do this is to create a custom Tokenizer instance and override the defaults, which is not ideal.)

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang / en English language data and models
Projects
None yet
Development

No branches or pull requests

2 participants