Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer.add_special_case not working when special token not followed by whitespace #1061

Closed
christian-storm opened this issue May 16, 2017 · 6 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@christian-storm
Copy link

When a special case token is followed by non-whitespace character the special token isn't recognized. I wrote a quick test to demonstrate:

import spacy
from spacy.symbols import ORTH

text = '...gimme...? that ...gimme...? or else ...gimme...?!'
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
#print([w.text for w in nlp(text)])
assert [w.text for w in nlp(text)] == ['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...gimme...?', '!']

The last '...gimme...?' is broken up into '...', 'gimme', '...', '?' by the presence of the '!'.

Your Environment

  • spaCy version: 1.8.2
  • Platform: Darwin-16.5.0-x86_64-i386-64bit
  • Python version: 3.6.0
  • Installed models: en_depent_web_md
@slavaGanzin
Copy link

slavaGanzin commented May 16, 2017

This bug replicates only when you already used pipepline:

import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en_depent_web_md')
text = '...gimme...? that ...gimme...? or else ...gimme...?!'
print([w.text for w in nlp(text)])
#['...', 'gimme', '...', '?', 'that', '...', 'gimme', '...', '?', 'or', 'else', '...', 'gimme', '...', '?', '!']
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
print([w.text for w in nlp(text)])
#['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...', 'gimme', '...', '?', '!']

But:

import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en_depent_web_md')
text = '...gimme...? that ...gimme...? or else ...gimme...?!'
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
print([w.text for w in nlp(text)])
#['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...gimme...?', '!']

Info about spaCy

  • spaCy version: 1.8.2
  • Platform: Linux-4.10.13-1-ARCH-x86_64-with-arch
  • Python version: 3.6.1
  • Installed models: en, en_depent_web_md

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label May 16, 2017
@honnibal
Copy link
Member

Both really helpful, thanks!

@christian-storm
Copy link
Author

It seems to be even a little more nuanced. Comments inline with test case:

import spacy
from spacy.symbols import ORTH

nlp = spacy.load('en_depent_web_md', parser=False, entity=False)
text = 'I like _MATH_ even _MATH_ when _MATH_, except when _MATH_ is _MATH_! but not _MATH_.'
print([w.text for w in nlp(text)])
# As expected it treats prefix and suffix symbols as tokens
#['I', 'like', '_', 'MATH', '_', 'even', '_', 'MATH', '_', 'when', '_', 'MATH', '_', ',', 'except', 'when', '_', 'MATH', '_', 'is', '_', 'MATH', '_', '!', 'but', 'not', '_', 'MATH_.']

nlp.tokenizer.add_special_case('_MATH_', [{ORTH: '_MATH_'}])
print([w.text for w in nlp(text)])
# Special case allows desired tokenization expect when token isn't followed by whitespace
#['I', 'like', '_MATH_', 'even', '_MATH_', 'when', '_', 'MATH', '_', ',', 'except', 'when', '_MATH_', 'is', '_', 'MATH', '_', '!', 'but', 'not', '_', 'MATH_.']

# Reset pipeline
nlp = spacy.load('en_depent_web_md', parser=False, entity=False)
nlp.tokenizer.add_special_case('_MATH_', [{ORTH: '_MATH_'}])
print([w.text for w in nlp(text)])
# As SlavaGanzin points out adding the special case before using the pipeline results in the expected behavior except when token followed by a period (or is lead by period)
# ['I', 'like', '_MATH_', 'even', '_MATH_', 'when', '_MATH_', ',', 'except', 'when', '_MATH_', 'is', '_MATH_', '!', 'but', 'not', '_', 'MATH_.']```

@honnibal
Copy link
Member

Kind thanks for the great report --- was a very long-standing cache invalidation bug. Comment from the patch:

Add flush_cache method to tokenizer, to fix #1061

The tokenizer caches output for common chunks, for efficiency. This
cache is be invalidated when the tokenizer rules change, e.g. when a new
special-case rule is introduced. That's what was causing #1061.

When the cache is flushed, we free the intermediate token chunks.
I think this is safe --- but if we start getting segfaults, this patch
is to blame. The resolution would be to simply not free those bits of
memory. They'll be freed when the tokenizer exits anyway.

@christian-storm
Copy link
Author

Migrating to v2 and reran this test....Works except when special case is followed by a single period. Of note is that before adding the special case the suffix _. remains attached to the token.

import spacy
from spacy.symbols import ORTH

nlp = spacy.load('en_core_web_sm', parser=False, entity=False)
text = 'I like _MATH_ even _MATH_ when _MATH_, except when _MATH_ is _MATH_! or _MATH_? or _MATH_: or _MATH_; or even _MATH_.. but not _MATH_. or _MATH_.'

print([w.text for w in nlp(text)])
# As expected it treats prefix and suffix symbols as tokens but it fails to when suffix is _.
# ['I', 'like', '_', 'MATH', '_', 'even', '_', 'MATH', '_', 'when', '_', 'MATH', '_', ',', 'except', 'when', '_', 'MATH', '_', 'is', '_', 'MATH', '_', '!', 'or', '_', 'MATH', '_', '?', 'or', '_', 'MATH', '_', ':', 'or', '_', 'MATH', '_', ';', 'or', 'even', '_', 'MATH', '_', '..', 'but', 'not', '_', 'MATH_.', 'or', '_', 'MATH_.']

nlp.tokenizer.add_special_case('_MATH_', [{ORTH: '_MATH_'}])
print([w.text for w in nlp(text)])
# Special case allows desired tokenization except when token is followed by period 
# ['I', 'like', '_MATH_', 'even', '_MATH_', 'when', '_MATH_', ',', 'except', 'when', '_MATH_', 'is', '_MATH_', '!', 'or', '_MATH_', '?', 'or', '_MATH_', ':', 'or', '_MATH_', ';', 'or', 'even', '_MATH_', '..', 'but', 'not', '_', 'MATH_.', 'or', '_', 'MATH_.']

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

3 participants