Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip RoFormer ONNX test if rjieba not installed #16981

Merged
merged 14 commits into from
May 4, 2022
Merged

Conversation

lewtun
Copy link
Member

@lewtun lewtun commented Apr 28, 2022

What does this PR do?

This PR adds the @require_rjieba decorator to the slow ONNX tests to deal with the following error in our daily CI runs:

(line 164)  ImportError: You need to install rjieba to use RoFormerTokenizer. See https://pypi.org/project/rjieba/ for installation.

I wasn't sure if rjieba should actually be installed in the GitHub workflow, but it doesn't seem to be the case for the RoFormer tests and so I omitted that for now.

Edit: I've added rjieba to the "tests" extras and also tested that the slow ONNX test passes when this dep is installed:

RUN_SLOW=1 pytest tests/onnx/test_onnx_v2.py -s -k "roformer"

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 28, 2022

The documentation is not available anymore as the PR was closed or merged.

@patrickvonplaten
Copy link
Contributor

So this test is then currently just skipped on our ONNX tests?

Should we maybe not better add rjieba to the test package to the test RoFormer (I couldn't find it in the setup.py or in the Docker file) cc @LysandreJik @sgugger

@sgugger
Copy link
Collaborator

sgugger commented Apr 28, 2022

It looks like RoFormer tokenization is completely untested yes, so this package should be added in the "testing" extra.

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 28, 2022

I agree we should include rjieba for the tests, unless there are reasons for not adding specific packages.

Further remark

If we could not add rjieba, it is not a good idea to add @require_rjieba for test_pytorch_export, otherwise this test won't be run for other models neither. In this case, I think we might need a specific way to skip this test for RoFormer (and others that require rjieba).

@lewtun
Copy link
Member Author

lewtun commented Apr 28, 2022

Thanks for the feedback - I'll add rjieba to our testing suite as well :)

@@ -71,3 +71,11 @@ def test_training_new_tokenizer(self):
# can't train new_tokenizer via Tokenizers lib
def test_training_new_tokenizer_with_special_tokens_change(self):
pass

# can't serialise custom PreTokenizer
def test_save_slow_from_fast_and_reload_fast(self):
Copy link
Member Author

@lewtun lewtun Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was failing with Exception: Custom PreTokenizer cannot be serialized:

    def test_save_slow_from_fast_and_reload_fast(self):
        if not self.test_slow_tokenizer or not self.test_rust_tokenizer:
            # we need both slow and fast versions
            return
    
        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
            with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
                with tempfile.TemporaryDirectory() as tmp_dir_1:
                    # Here we check that even if we have initialized a fast tokenizer with a tokenizer_file we can
                    # still save only the slow version and use these saved files to rebuild a tokenizer
                    tokenizer_fast_old_1 = self.rust_tokenizer_class.from_pretrained(
                        pretrained_name, **kwargs, use_fast=True
                    )
                    tokenizer_file = os.path.join(tmp_dir_1, "tokenizer.json")
>                   tokenizer_fast_old_1.backend_tokenizer.save(tokenizer_file)
E                   Exception: Custom PreTokenizer cannot be serialized

Looking at this similar issue huggingface/tokenizers#613 it seems that RoFormer belong to the class of tokenizers that can't be saved with tokenizer.backend_tokenizer.save(). Here's an example to reproduce:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("junnyu/roformer_chinese_small", use_fast=True)
tokenizer.backend_tokenizer.save(".")

I'm not very familiar with RoFormer, so happy to debug this further if we expect this test really should pass.

Copy link
Contributor

@patrickvonplaten patrickvonplaten Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one I think we can definitely skip since this tokenizer cannot be saved in the "fast" format

pass

# can't serialise custom PreTokenizer
def test_saving_tokenizer_trainer(self):
Copy link
Member Author

@lewtun lewtun Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails for a different reason - as far as I can tell, saving the fast tokenizer and loading it again fails because vocab_file is None on the init:

src/transformers/tokenization_utils_base.py:1783: in from_pretrained
    return cls._from_pretrained(
src/transformers/tokenization_utils_base.py:1809: in _from_pretrained
    slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
src/transformers/tokenization_utils_base.py:1918: in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
src/transformers/models/roformer/tokenization_roformer.py:145: in __init__
    if not os.path.isfile(vocab_file):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path = None

    def isfile(path):
        """Test whether a path is a regular file"""
        try:
>           st = os.stat(path)
E           TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Here's an example to reproduce:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("junnyu/roformer_chinese_small", use_fast=True)
tokenizer.save_pretrained("./tmp/roformer", legacy_format=False)
AutoTokenizer.from_pretrained("./tmp/roformer/")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you discovered a bug here @lewtun !

I've done some digging and I think the bug is that the vocab file is not correctly defined for RoFormer I think.

Could you replace

VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}

by:

VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt", "tokenizer_file": "tokenizer.json"}

This should make your codesnippet above work and hopefully also make the test pass

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice! Fixed in 1abc58a

With this fix, the following slow tests now all pass:

RUN_SLOW=1 pytest tests/roformer/test_tokenization_roformer.py -s

@lewtun
Copy link
Member Author

lewtun commented Apr 28, 2022

Hey @sgugger @patrickvonplaten I'm hitting some peculiar issues with 2 of the slow tests of the RoFormer tokenizer. Would you mind taking a look and seeing if my decision to skip them is valid?

@sgugger
Copy link
Collaborator

sgugger commented Apr 28, 2022

I'll let @patrickvonplaten decide as I know nothing on that model too :-)

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we can fix 1 test by correcting a bug as mentioned in a comment. Happy to skip the other test and merge afterward :-)

@LysandreJik
Copy link
Member

There is a test dedicated to custom tokenizers with specific dependencies: https:/huggingface/transformers/blob/main/.circleci/config.yml#L538

It installs jieba but not rjieba. Would it make sense to add it there? If you're testing for ONNX, it's very likely that it does not make sense as it's limited to tokenizer tests right now.

@lewtun
Copy link
Member Author

lewtun commented May 3, 2022

There is a test dedicated to custom tokenizers with specific dependencies: https:/huggingface/transformers/blob/main/.circleci/config.yml#L538

It installs jieba but not rjieba. Would it make sense to add it there? If you're testing for ONNX, it's very likely that it does not make sense as it's limited to tokenizer tests right now.

Thanks for the tip! Done in 3cafcb2

@@ -549,7 +549,7 @@ jobs:
- v0.4-custom_tokenizers-{{ checksum "setup.py" }}
- v0.4-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[ja,testing,sentencepiece,jieba,spacy,ftfy]
- run: pip install .[ja,testing,sentencepiece,jieba,spacy,ftfy,rjieba]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds rjieba to the custom tokenizer tests on CircleCI

@lewtun
Copy link
Member Author

lewtun commented May 3, 2022

Hey @patrickvonplaten @LysandreJik I think this PR is ready for a final pass :)

The failing test is unrelated to the PR itself (a failing Pegasus generate test)

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to rebase for the move of the test files. Otherwise LGTM!

@lewtun lewtun merged commit 4bb1d0e into main May 4, 2022
@lewtun lewtun deleted the fix-onnx-roformer branch May 4, 2022 08:04
nandwalritik pushed a commit to nandwalritik/transformers that referenced this pull request May 4, 2022
* Skip RoFormer ONNX test if rjieba not installed

* Update deps table

* Skip RoFormer serialization test

* Fix RoFormer vocab

* Add rjieba to CircleCI
nandwalritik pushed a commit to nandwalritik/transformers that referenced this pull request May 4, 2022
* Skip RoFormer ONNX test if rjieba not installed

* Update deps table

* Skip RoFormer serialization test

* Fix RoFormer vocab

* Add rjieba to CircleCI
Narsil pushed a commit to Narsil/transformers that referenced this pull request May 12, 2022
* Skip RoFormer ONNX test if rjieba not installed

* Update deps table

* Skip RoFormer serialization test

* Fix RoFormer vocab

* Add rjieba to CircleCI
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
* Skip RoFormer ONNX test if rjieba not installed

* Update deps table

* Skip RoFormer serialization test

* Fix RoFormer vocab

* Add rjieba to CircleCI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants