-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenizer.add_special_case not working when special token not followed by whitespace #1061
Comments
This bug replicates only when you already used pipepline: import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en_depent_web_md')
text = '...gimme...? that ...gimme...? or else ...gimme...?!'
print([w.text for w in nlp(text)])
#['...', 'gimme', '...', '?', 'that', '...', 'gimme', '...', '?', 'or', 'else', '...', 'gimme', '...', '?', '!']
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
print([w.text for w in nlp(text)])
#['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...', 'gimme', '...', '?', '!'] But: import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en_depent_web_md')
text = '...gimme...? that ...gimme...? or else ...gimme...?!'
nlp.tokenizer.add_special_case(u'...gimme...?', [{ORTH: u'...gimme...?'}])
print([w.text for w in nlp(text)])
#['...gimme...?', 'that', '...gimme...?', 'or', 'else', '...gimme...?', '!'] Info about spaCy
|
Both really helpful, thanks! |
It seems to be even a little more nuanced. Comments inline with test case: import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en_depent_web_md', parser=False, entity=False)
text = 'I like _MATH_ even _MATH_ when _MATH_, except when _MATH_ is _MATH_! but not _MATH_.'
print([w.text for w in nlp(text)])
# As expected it treats prefix and suffix symbols as tokens
#['I', 'like', '_', 'MATH', '_', 'even', '_', 'MATH', '_', 'when', '_', 'MATH', '_', ',', 'except', 'when', '_', 'MATH', '_', 'is', '_', 'MATH', '_', '!', 'but', 'not', '_', 'MATH_.']
nlp.tokenizer.add_special_case('_MATH_', [{ORTH: '_MATH_'}])
print([w.text for w in nlp(text)])
# Special case allows desired tokenization expect when token isn't followed by whitespace
#['I', 'like', '_MATH_', 'even', '_MATH_', 'when', '_', 'MATH', '_', ',', 'except', 'when', '_MATH_', 'is', '_', 'MATH', '_', '!', 'but', 'not', '_', 'MATH_.']
# Reset pipeline
nlp = spacy.load('en_depent_web_md', parser=False, entity=False)
nlp.tokenizer.add_special_case('_MATH_', [{ORTH: '_MATH_'}])
print([w.text for w in nlp(text)])
# As SlavaGanzin points out adding the special case before using the pipeline results in the expected behavior except when token followed by a period (or is lead by period)
# ['I', 'like', '_MATH_', 'even', '_MATH_', 'when', '_MATH_', ',', 'except', 'when', '_MATH_', 'is', '_MATH_', '!', 'but', 'not', '_', 'MATH_.']``` |
Kind thanks for the great report --- was a very long-standing cache invalidation bug. Comment from the patch:
|
Migrating to v2 and reran this test....Works except when special case is followed by a single period. Of note is that before adding the special case the suffix _. remains attached to the token.
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
When a special case token is followed by non-whitespace character the special token isn't recognized. I wrote a quick test to demonstrate:
The last '...gimme...?' is broken up into '...', 'gimme', '...', '?' by the presence of the '!'.
Your Environment
The text was updated successfully, but these errors were encountered: