Tokenizer cache doesn't handle modifications to special cases or token_match correctly #4238

adrianeboyd · 2019-09-04T18:26:27Z

How to reproduce the behaviour

The github suggested related issues were actually helpful! #1061 seems to have snuck back in. It works in 2.0.18, not in 2.1.0.

Modifications to special cases and token_match don't work if the pipeline has been run at least once due to the tokenizer cache.

import spacy
from spacy.symbols import ORTH

text = '(_SPECIAL_) A/B'

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer.add_special_case('_SPECIAL_', [{ORTH: '_SPECIAL_'}])
nlp.tokenizer.add_special_case('A/B', [{ORTH: 'A/B'}])
print([token.text for token in nlp(text)])
# ['(', '_SPECIAL_', ')', 'A/B']

nlp = spacy.load('en_core_web_sm')
print([token.text for token in nlp(text)])
# ['(', '_', 'SPECIAL', '_', ')', 'A', '/', 'B']
nlp.tokenizer.add_special_case('_SPECIAL_', [{ORTH: '_SPECIAL_'}])
nlp.tokenizer.add_special_case('A/B', [{ORTH: 'A/B'}])
print([token.text for token in nlp(text)])
# ['(', '_', 'SPECIAL', '_', ')', 'A/B']

text = "This is a URL: http://example.com/file.html."

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer.token_match = None
print([token.text for token in nlp(text)])
# ['This', 'is', 'a', 'URL', ':', 'http://example.com', '/', 'file.html', '.']

nlp = spacy.load('en_core_web_sm')
print([token.text for token in nlp(text)])
# ['This', 'is', 'a', 'URL', ':', 'http://example.com/file.html', '.']
nlp.tokenizer.token_match = None
print([token.text for token in nlp(text)])
# ['This', 'is', 'a', 'URL', ':', 'http://example.com/file.html', '.']

Info about spaCy

spaCy version: 2.1.8
Platform: Linux-4.19.0-5-amd64-x86_64-with-debian-10.0
Python version: 3.7.3

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2019-09-06T08:01:12Z

This caching problem has been making me think I was losing my mind while testing special cases and token_match with the tokenizer.

Here's the commit that went missing in from v1->v2 that deals with the cache problem:

4b2e5e5

I think that a solution like this could fix the problem, but I'm not sure it's 100% correct for v2.

When I test this with 2.0.18 it seems to work, but I'm not sure why given the minimal differences in the tokenizer between 2.0.18 and 2.1.8.

honnibal · 2019-09-08T11:01:51Z

Wow, no idea how that patch went missing! Glad I wrote some notes on that...

So can we just take that commit?

adrianeboyd · 2019-09-08T11:05:05Z

No, it doesn't quite work, either. I have a new version coming...

Flush tokenizer cache when affixes, token_match, or special cases are modified. Fixes explosion#4238, same issue as in explosion#1250.

Flush tokenizer cache when affixes, token_match, or special cases are modified. Fixes #4238, same issue as in #1250.

lock · 2019-10-08T19:42:54Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

adrianeboyd changed the title ~~Tokenizer special cases behave inconsistently depending on when they are added~~ Tokenizer special cases behave inconsistently depending on pipeline state Sep 4, 2019

DomHudson mentioned this issue Sep 4, 2019

Subtlety in tokenizer documentation #4234

Closed

svlandeg added bug Bugs and behaviour differing from documentation feat / tokenizer Feature: Tokenizer labels Sep 4, 2019

adrianeboyd mentioned this issue Sep 5, 2019

Email-like text tokenized wrongly #4225

Closed

adrianeboyd changed the title ~~Tokenizer special cases behave inconsistently depending on pipeline state~~ Tokenizer cache doesn't handle modifications to special cases or token_match correctly Sep 6, 2019

adrianeboyd added a commit to adrianeboyd/spaCy that referenced this issue Sep 8, 2019

Flush tokenizer cache when necessary

f86c3a5

Flush tokenizer cache when affixes, token_match, or special cases are modified. Fixes explosion#4238, same issue as in explosion#1250.

adrianeboyd added a commit to adrianeboyd/spaCy that referenced this issue Sep 8, 2019

Flush tokenizer cache when necessary

6f35597

Flush tokenizer cache when affixes, token_match, or special cases are modified. Fixes explosion#4238, same issue as in explosion#1250.

adrianeboyd mentioned this issue Sep 8, 2019

Flush tokenizer cache when necessary #4258

Merged

3 tasks

honnibal closed this as completed in #4258 Sep 8, 2019

honnibal pushed a commit that referenced this issue Sep 8, 2019

Flush tokenizer cache when necessary (#4258)

3780e2f

Flush tokenizer cache when affixes, token_match, or special cases are modified. Fixes #4238, same issue as in #1250.

lock bot locked as resolved and limited conversation to collaborators Oct 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer cache doesn't handle modifications to special cases or token_match correctly #4238

Tokenizer cache doesn't handle modifications to special cases or token_match correctly #4238

adrianeboyd commented Sep 4, 2019 •

edited

Loading

adrianeboyd commented Sep 6, 2019

honnibal commented Sep 8, 2019

adrianeboyd commented Sep 8, 2019

lock bot commented Oct 8, 2019

Tokenizer cache doesn't handle modifications to special cases or token_match correctly #4238

Tokenizer cache doesn't handle modifications to special cases or token_match correctly #4238

Comments

adrianeboyd commented Sep 4, 2019 • edited Loading

How to reproduce the behaviour

Info about spaCy

adrianeboyd commented Sep 6, 2019

honnibal commented Sep 8, 2019

adrianeboyd commented Sep 8, 2019

lock bot commented Oct 8, 2019

adrianeboyd commented Sep 4, 2019 •

edited

Loading