Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer does not properly serialize to disk #4190

Closed
mikerossgithub opened this issue Aug 23, 2019 · 3 comments
Closed

Tokenizer does not properly serialize to disk #4190

mikerossgithub opened this issue Aug 23, 2019 · 3 comments
Labels
bug Bugs and behaviour differing from documentation feat / serialize Feature: Serialization, saving and loading feat / tokenizer Feature: Tokenizer

Comments

@mikerossgithub
Copy link

How to reproduce the behaviour

I am using spacy's default Tokenizer, with a slightly modified set of exceptions (no exceptions for single letters with periods). The customized Language properly tokenizes. But after saving and reloading from disk, the tokenizer is no longer customized:

Code Output:

Original Tokenizer:
[Test, c.]
Customized Tokenizer:
[Test, c, .]
Saved and reloaded Tokenizer, should be the same as customized:
[Test, c.]

Code to reproduce:

import spacy
from spacy.tokenizer import Tokenizer

def customize_tokenizer(nlp):
    prefix_re = spacy.util.compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = spacy.util.compile_suffix_regex(nlp.Defaults.suffixes)
    infix_re = spacy.util.compile_infix_regex(nlp.Defaults.infixes)

    # remove all exceptions where a single letter is followed by a period (e.g. 'h.')
    exceptions = {k: v for k,v in dict(nlp.Defaults.tokenizer_exceptions).items() if not (len(k) == 2 and k[1] == '.')}
    new_tokenizer = Tokenizer(nlp.vocab, exceptions,
                              prefix_search=prefix_re.search,
                              suffix_search=suffix_re.search,
                              infix_finditer=infix_re.finditer,
                              token_match=nlp.tokenizer.token_match)

    nlp.tokenizer = new_tokenizer

# Load default Language
nlp = spacy.load('en_core_web_sm')
print("Original Tokenizer:")
print(list(nlp("Test c.")))

# Modify Tokenizer
customize_tokenizer(nlp)
print("Customized Tokenizer:")
print(list(nlp("Test c.")))

# Save and Reload
nlp.to_disk('x')
nlp = spacy.load('x')
print("Saved and reloaded Tokenizer, should be the same as customized:")
print(list(nlp("Test c.")))

Your Environment

Info about spaCy

  • spaCy version: 2.1.8
  • Platform: Linux-4.15.0-58-generic-x86_64-with-Ubuntu-16.04-xenial
  • Python version: 3.6.8
@mikerossgithub
Copy link
Author

Note this is not the same as #2682 which used a different Tokenizer class

@ines ines added bug Bugs and behaviour differing from documentation feat / serialize Feature: Serialization, saving and loading feat / tokenizer Feature: Tokenizer labels Aug 26, 2019
@svlandeg
Copy link
Member

Thanks for the very helpful report! We were able to find and address the bug - cf PR #4207.

@lock
Copy link

lock bot commented Sep 29, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / serialize Feature: Serialization, saving and loading feat / tokenizer Feature: Tokenizer
Projects
None yet
Development

No branches or pull requests

3 participants