tokenizer.add_special_case not working when special token not followed by whitespace #1061

christian-storm · 2017-05-16T01:15:12Z

When a special case token is followed by non-whitespace character the special token isn't recognized. I wrote a quick test to demonstrate:

import spacy
from spacy.symbols import ORTH

text = '...gimme...? that ...gimme...? or else ...gimme...?!'
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
#print([w.text for w in nlp(text)])
assert [w.text for w in nlp(text)] == ['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...gimme...?', '!']

The last '...gimme...?' is broken up into '...', 'gimme', '...', '?' by the presence of the '!'.

Your Environment

spaCy version: 1.8.2
Platform: Darwin-16.5.0-x86_64-i386-64bit
Python version: 3.6.0
Installed models: en_depent_web_md

slavaGanzin · 2017-05-16T09:11:17Z

This bug replicates only when you already used pipepline:

import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en_depent_web_md')
text = '...gimme...? that ...gimme...? or else ...gimme...?!'
print([w.text for w in nlp(text)])
#['...', 'gimme', '...', '?', 'that', '...', 'gimme', '...', '?', 'or', 'else', '...', 'gimme', '...', '?', '!']
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
print([w.text for w in nlp(text)])
#['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...', 'gimme', '...', '?', '!']

But:

import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en_depent_web_md')
text = '...gimme...? that ...gimme...? or else ...gimme...?!'
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
print([w.text for w in nlp(text)])
#['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...gimme...?', '!']

Info about spaCy

spaCy version: 1.8.2
Platform: Linux-4.10.13-1-ARCH-x86_64-with-arch
Python version: 3.6.1
Installed models: en, en_depent_web_md

honnibal · 2017-05-16T19:55:59Z

Both really helpful, thanks!

christian-storm · 2017-05-16T21:01:59Z

It seems to be even a little more nuanced. Comments inline with test case:

import spacy
from spacy.symbols import ORTH

nlp = spacy.load('en_depent_web_md', parser=False, entity=False)
text = 'I like _MATH_ even _MATH_ when _MATH_, except when _MATH_ is _MATH_! but not _MATH_.'
print([w.text for w in nlp(text)])
# As expected it treats prefix and suffix symbols as tokens
#['I', 'like', '_', 'MATH', '_', 'even', '_', 'MATH', '_', 'when', '_', 'MATH', '_', ',', 'except', 'when', '_', 'MATH', '_', 'is', '_', 'MATH', '_', '!', 'but', 'not', '_', 'MATH_.']

nlp.tokenizer.add_special_case('_MATH_', [{ORTH: '_MATH_'}])
print([w.text for w in nlp(text)])
# Special case allows desired tokenization expect when token isn't followed by whitespace
#['I', 'like', '_MATH_', 'even', '_MATH_', 'when', '_', 'MATH', '_', ',', 'except', 'when', '_MATH_', 'is', '_', 'MATH', '_', '!', 'but', 'not', '_', 'MATH_.']

# Reset pipeline
nlp = spacy.load('en_depent_web_md', parser=False, entity=False)
nlp.tokenizer.add_special_case('_MATH_', [{ORTH: '_MATH_'}])
print([w.text for w in nlp(text)])
# As SlavaGanzin points out adding the special case before using the pipeline results in the expected behavior except when token followed by a period (or is lead by period)
# ['I', 'like', '_MATH_', 'even', '_MATH_', 'when', '_MATH_', ',', 'except', 'when', '_MATH_', 'is', '_MATH_', '!', 'but', 'not', '_', 'MATH_.']```

honnibal · 2017-07-22T13:11:32Z

Kind thanks for the great report --- was a very long-standing cache invalidation bug. Comment from the patch:

Add flush_cache method to tokenizer, to fix #1061

The tokenizer caches output for common chunks, for efficiency. This
cache is be invalidated when the tokenizer rules change, e.g. when a new
special-case rule is introduced. That's what was causing #1061.

When the cache is flushed, we free the intermediate token chunks.
I think this is safe --- but if we start getting segfaults, this patch
is to blame. The resolution would be to simply not free those bits of
memory. They'll be freed when the tokenizer exits anyway.

christian-storm · 2017-09-06T19:15:25Z

Migrating to v2 and reran this test....Works except when special case is followed by a single period. Of note is that before adding the special case the suffix _. remains attached to the token.

import spacy
from spacy.symbols import ORTH

nlp = spacy.load('en_core_web_sm', parser=False, entity=False)
text = 'I like _MATH_ even _MATH_ when _MATH_, except when _MATH_ is _MATH_! or _MATH_? or _MATH_: or _MATH_; or even _MATH_.. but not _MATH_. or _MATH_.'

print([w.text for w in nlp(text)])
# As expected it treats prefix and suffix symbols as tokens but it fails to when suffix is _.
# ['I', 'like', '_', 'MATH', '_', 'even', '_', 'MATH', '_', 'when', '_', 'MATH', '_', ',', 'except', 'when', '_', 'MATH', '_', 'is', '_', 'MATH', '_', '!', 'or', '_', 'MATH', '_', '?', 'or', '_', 'MATH', '_', ':', 'or', '_', 'MATH', '_', ';', 'or', 'even', '_', 'MATH', '_', '..', 'but', 'not', '_', 'MATH_.', 'or', '_', 'MATH_.']

nlp.tokenizer.add_special_case('_MATH_', [{ORTH: '_MATH_'}])
print([w.text for w in nlp(text)])
# Special case allows desired tokenization except when token is followed by period 
# ['I', 'like', '_MATH_', 'even', '_MATH_', 'when', '_MATH_', ',', 'except', 'when', '_MATH_', 'is', '_MATH_', '!', 'or', '_MATH_', '?', 'or', '_MATH_', ':', 'or', '_MATH_', ';', 'or', 'even', '_MATH_', '..', 'but', 'not', '_', 'MATH_.', 'or', '_', 'MATH_.']

lock · 2018-05-08T17:27:41Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the bug Bugs and behaviour differing from documentation label May 16, 2017

honnibal closed this as completed in 4b2e5e5 Jul 22, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer.add_special_case not working when special token not followed by whitespace #1061

tokenizer.add_special_case not working when special token not followed by whitespace #1061

christian-storm commented May 16, 2017

slavaGanzin commented May 16, 2017 •

edited

Loading

honnibal commented May 16, 2017

christian-storm commented May 16, 2017

honnibal commented Jul 22, 2017

christian-storm commented Sep 6, 2017

lock bot commented May 8, 2018

tokenizer.add_special_case not working when special token not followed by whitespace #1061

tokenizer.add_special_case not working when special token not followed by whitespace #1061

Comments

christian-storm commented May 16, 2017

Your Environment

slavaGanzin commented May 16, 2017 • edited Loading

Info about spaCy

honnibal commented May 16, 2017

christian-storm commented May 16, 2017

honnibal commented Jul 22, 2017

christian-storm commented Sep 6, 2017

lock bot commented May 8, 2018

slavaGanzin commented May 16, 2017 •

edited

Loading