Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer cache doesn't handle modifications to special cases or token_match correctly #4238

Closed
adrianeboyd opened this issue Sep 4, 2019 · 4 comments · Fixed by #4258
Closed
Labels
bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer

Comments

@adrianeboyd
Copy link
Contributor

adrianeboyd commented Sep 4, 2019

How to reproduce the behaviour

The github suggested related issues were actually helpful! #1061 seems to have snuck back in. It works in 2.0.18, not in 2.1.0.

Modifications to special cases and token_match don't work if the pipeline has been run at least once due to the tokenizer cache.

import spacy
from spacy.symbols import ORTH

text = '(_SPECIAL_) A/B'

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer.add_special_case('_SPECIAL_', [{ORTH: '_SPECIAL_'}])
nlp.tokenizer.add_special_case('A/B', [{ORTH: 'A/B'}])
print([token.text for token in nlp(text)])
# ['(', '_SPECIAL_', ')', 'A/B']

nlp = spacy.load('en_core_web_sm')
print([token.text for token in nlp(text)])
# ['(', '_', 'SPECIAL', '_', ')', 'A', '/', 'B']
nlp.tokenizer.add_special_case('_SPECIAL_', [{ORTH: '_SPECIAL_'}])
nlp.tokenizer.add_special_case('A/B', [{ORTH: 'A/B'}])
print([token.text for token in nlp(text)])
# ['(', '_', 'SPECIAL', '_', ')', 'A/B']

text = "This is a URL: http://example.com/file.html."

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer.token_match = None
print([token.text for token in nlp(text)])
# ['This', 'is', 'a', 'URL', ':', 'http://example.com', '/', 'file.html', '.']

nlp = spacy.load('en_core_web_sm')
print([token.text for token in nlp(text)])
# ['This', 'is', 'a', 'URL', ':', 'http://example.com/file.html', '.']
nlp.tokenizer.token_match = None
print([token.text for token in nlp(text)])
# ['This', 'is', 'a', 'URL', ':', 'http://example.com/file.html', '.']

Info about spaCy

  • spaCy version: 2.1.8
  • Platform: Linux-4.19.0-5-amd64-x86_64-with-debian-10.0
  • Python version: 3.7.3
@adrianeboyd adrianeboyd changed the title Tokenizer special cases behave inconsistently depending on when they are added Tokenizer special cases behave inconsistently depending on pipeline state Sep 4, 2019
@svlandeg svlandeg added bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer labels Sep 4, 2019
@adrianeboyd
Copy link
Contributor Author

This caching problem has been making me think I was losing my mind while testing special cases and token_match with the tokenizer.

Here's the commit that went missing in from v1->v2 that deals with the cache problem:

4b2e5e5

I think that a solution like this could fix the problem, but I'm not sure it's 100% correct for v2.

When I test this with 2.0.18 it seems to work, but I'm not sure why given the minimal differences in the tokenizer between 2.0.18 and 2.1.8.

@adrianeboyd adrianeboyd changed the title Tokenizer special cases behave inconsistently depending on pipeline state Tokenizer cache doesn't handle modifications to special cases or token_match correctly Sep 6, 2019
@honnibal
Copy link
Member

honnibal commented Sep 8, 2019

Wow, no idea how that patch went missing! Glad I wrote some notes on that...

So can we just take that commit?

@adrianeboyd
Copy link
Contributor Author

No, it doesn't quite work, either. I have a new version coming...

adrianeboyd added a commit to adrianeboyd/spaCy that referenced this issue Sep 8, 2019
Flush tokenizer cache when affixes, token_match, or special cases are
modified.

Fixes explosion#4238, same issue as in explosion#1250.
adrianeboyd added a commit to adrianeboyd/spaCy that referenced this issue Sep 8, 2019
Flush tokenizer cache when affixes, token_match, or special cases are
modified.

Fixes explosion#4238, same issue as in explosion#1250.
honnibal pushed a commit that referenced this issue Sep 8, 2019
Flush tokenizer cache when affixes, token_match, or special cases are
modified.

Fixes #4238, same issue as in #1250.
@lock
Copy link

lock bot commented Oct 8, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 8, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants