Skip RoFormer ONNX test if rjieba not installed #16981

lewtun · 2022-04-28T05:32:15Z

What does this PR do?

This PR adds the @require_rjieba decorator to the slow ONNX tests to deal with the following error in our daily CI runs:

(line 164)  ImportError: You need to install rjieba to use RoFormerTokenizer. See https://pypi.org/project/rjieba/ for installation.

~~I wasn't sure if rjieba should actually be installed in the GitHub workflow, but it doesn't seem to be the case for the RoFormer tests and so I omitted that for now.~~

Edit: I've added rjieba to the "tests" extras and also tested that the slow ONNX test passes when this dep is installed:

RUN_SLOW=1 pytest tests/onnx/test_onnx_v2.py -s -k "roformer"

HuggingFaceDocBuilderDev · 2022-04-28T05:49:46Z

The documentation is not available anymore as the PR was closed or merged.

patrickvonplaten · 2022-04-28T13:00:24Z

So this test is then currently just skipped on our ONNX tests?

Should we maybe not better add rjieba to the test package to the test RoFormer (I couldn't find it in the setup.py or in the Docker file) cc @LysandreJik @sgugger

sgugger · 2022-04-28T13:11:58Z

It looks like RoFormer tokenization is completely untested yes, so this package should be added in the "testing" extra.

ydshieh · 2022-04-28T13:20:13Z

I agree we should include rjieba for the tests, unless there are reasons for not adding specific packages.

Further remark

If we could not add rjieba, it is not a good idea to add @require_rjieba for test_pytorch_export, otherwise this test won't be run for other models neither. In this case, I think we might need a specific way to skip this test for RoFormer (and others that require rjieba).

lewtun · 2022-04-28T13:43:18Z

Thanks for the feedback - I'll add rjieba to our testing suite as well :)

lewtun · 2022-04-28T15:29:44Z

tests/roformer/test_tokenization_roformer.py

@@ -71,3 +71,11 @@ def test_training_new_tokenizer(self):
 # can't train new_tokenizer via Tokenizers lib
 def test_training_new_tokenizer_with_special_tokens_change(self):
 pass
+
+ # can't serialise custom PreTokenizer
+ def test_save_slow_from_fast_and_reload_fast(self):


This test was failing with Exception: Custom PreTokenizer cannot be serialized:

def test_save_slow_from_fast_and_reload_fast(self): if not self.test_slow_tokenizer or not self.test_rust_tokenizer: # we need both slow and fast versions return for tokenizer, pretrained_name, kwargs in self.tokenizers_list: with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"): with tempfile.TemporaryDirectory() as tmp_dir_1: # Here we check that even if we have initialized a fast tokenizer with a tokenizer_file we can # still save only the slow version and use these saved files to rebuild a tokenizer tokenizer_fast_old_1 = self.rust_tokenizer_class.from_pretrained( pretrained_name, **kwargs, use_fast=True ) tokenizer_file = os.path.join(tmp_dir_1, "tokenizer.json") > tokenizer_fast_old_1.backend_tokenizer.save(tokenizer_file) E Exception: Custom PreTokenizer cannot be serialized

Looking at this similar issue huggingface/tokenizers#613 it seems that RoFormer belong to the class of tokenizers that can't be saved with tokenizer.backend_tokenizer.save(). Here's an example to reproduce:

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("junnyu/roformer_chinese_small", use_fast=True) tokenizer.backend_tokenizer.save(".")

I'm not very familiar with RoFormer, so happy to debug this further if we expect this test really should pass.

This one I think we can definitely skip since this tokenizer cannot be saved in the "fast" format

lewtun · 2022-04-28T15:39:06Z

tests/roformer/test_tokenization_roformer.py

+ pass
+
+ # can't serialise custom PreTokenizer
+ def test_saving_tokenizer_trainer(self):


This test fails for a different reason - as far as I can tell, saving the fast tokenizer and loading it again fails because vocab_file is None on the init:

src/transformers/tokenization_utils_base.py:1783: in from_pretrained return cls._from_pretrained( src/transformers/tokenization_utils_base.py:1809: in _from_pretrained slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained( src/transformers/tokenization_utils_base.py:1918: in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) src/transformers/models/roformer/tokenization_roformer.py:145: in __init__ if not os.path.isfile(vocab_file): _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ path = None def isfile(path): """Test whether a path is a regular file""" try: > st = os.stat(path) E TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Here's an example to reproduce:

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("junnyu/roformer_chinese_small", use_fast=True) tokenizer.save_pretrained("./tmp/roformer", legacy_format=False) AutoTokenizer.from_pretrained("./tmp/roformer/")

I think you discovered a bug here @lewtun !

I've done some digging and I think the bug is that the vocab file is not correctly defined for RoFormer I think.

Could you replace

transformers/src/transformers/models/roformer/tokenization_roformer_fast.py

Line 30 in e6f00a1

VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}

by:

VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt", "tokenizer_file": "tokenizer.json"}

This should make your codesnippet above work and hopefully also make the test pass

Thanks for the advice! Fixed in 1abc58a

With this fix, the following slow tests now all pass:

RUN_SLOW=1 pytest tests/roformer/test_tokenization_roformer.py -s

lewtun · 2022-04-28T15:40:47Z

Hey @sgugger @patrickvonplaten I'm hitting some peculiar issues with 2 of the slow tests of the RoFormer tokenizer. Would you mind taking a look and seeing if my decision to skip them is valid?

sgugger · 2022-04-28T16:50:18Z

I'll let @patrickvonplaten decide as I know nothing on that model too :-)

patrickvonplaten

Think we can fix 1 test by correcting a bug as mentioned in a comment. Happy to skip the other test and merge afterward :-)

LysandreJik · 2022-04-29T12:12:55Z

There is a test dedicated to custom tokenizers with specific dependencies: https:/huggingface/transformers/blob/main/.circleci/config.yml#L538

It installs jieba but not rjieba. Would it make sense to add it there? If you're testing for ONNX, it's very likely that it does not make sense as it's limited to tokenizer tests right now.

lewtun · 2022-05-03T07:31:30Z

There is a test dedicated to custom tokenizers with specific dependencies: https:/huggingface/transformers/blob/main/.circleci/config.yml#L538

It installs jieba but not rjieba. Would it make sense to add it there? If you're testing for ONNX, it's very likely that it does not make sense as it's limited to tokenizer tests right now.

Thanks for the tip! Done in 3cafcb2

lewtun · 2022-05-03T07:33:32Z

.circleci/config.yml

@@ -549,7 +549,7 @@ jobs:
 - v0.4-custom_tokenizers-{{ checksum "setup.py" }}
 - v0.4-{{ checksum "setup.py" }}
 - run: pip install --upgrade pip
- - run: pip install .[ja,testing,sentencepiece,jieba,spacy,ftfy]
+ - run: pip install .[ja,testing,sentencepiece,jieba,spacy,ftfy,rjieba]


This adds rjieba to the custom tokenizer tests on CircleCI

lewtun · 2022-05-03T08:19:50Z

Hey @patrickvonplaten @LysandreJik I think this PR is ready for a final pass :)

The failing test is unrelated to the PR itself (a failing Pegasus generate test)

sgugger

You'll need to rebase for the move of the test files. Otherwise LGTM!

* Skip RoFormer ONNX test if rjieba not installed * Update deps table * Skip RoFormer serialization test * Fix RoFormer vocab * Add rjieba to CircleCI

Skip RoFormer ONNX test if rjieba not installed

6ccd378

lewtun requested a review from patrickvonplaten April 28, 2022 05:32

Remove rjieba decorator from TF tests

5ab3067

lewtun force-pushed the fix-onnx-roformer branch from 94578a0 to 5ab3067 Compare April 28, 2022 07:44

lewtun mentioned this pull request Apr 28, 2022

Remove masked image modeling from BEIT ONNX export #16980

Merged

patrickvonplaten requested review from ydshieh, LysandreJik and sgugger April 28, 2022 13:00

lewtun added 5 commits April 28, 2022 15:49

Add rjieba dep to tests extra

30fde24

Update deps table

a099813

Skip RoFormer serialization test

04d3739

Skip trainer save

6e75c48

Add comment

6390c64

lewtun commented Apr 28, 2022

View reviewed changes

patrickvonplaten reviewed Apr 29, 2022

View reviewed changes

lewtun added 6 commits May 3, 2022 09:19

Fix RoFormer vocab

1abc58a

Remove pass on Trainer test for RoFormer

899137c

Merge branch 'main' into fix-onnx-roformer

e7c85b5

Add rjieba to CircleCI

3cafcb2

Add rjieba to ORT test deps

31e6083

Fix deps

5b6dca8

lewtun commented May 3, 2022

View reviewed changes

patrickvonplaten approved these changes May 3, 2022

View reviewed changes

sgugger approved these changes May 3, 2022

View reviewed changes

Merge branch 'main' into fix-onnx-roformer

2199a10

lewtun merged commit 4bb1d0e into main May 4, 2022

lewtun deleted the fix-onnx-roformer branch May 4, 2022 08:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip RoFormer ONNX test if rjieba not installed #16981

Skip RoFormer ONNX test if rjieba not installed #16981

lewtun commented Apr 28, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 28, 2022 •

edited

Loading

patrickvonplaten commented Apr 28, 2022

sgugger commented Apr 28, 2022

ydshieh commented Apr 28, 2022 •

edited

Loading

lewtun commented Apr 28, 2022

lewtun Apr 28, 2022 •

edited

Loading

patrickvonplaten Apr 29, 2022 •

edited

Loading

lewtun Apr 28, 2022 •

edited by patrickvonplaten

Loading

patrickvonplaten Apr 29, 2022

lewtun May 3, 2022

lewtun commented Apr 28, 2022

sgugger commented Apr 28, 2022

patrickvonplaten left a comment

LysandreJik commented Apr 29, 2022

lewtun commented May 3, 2022

lewtun May 3, 2022

lewtun commented May 3, 2022

sgugger left a comment

Skip RoFormer ONNX test if rjieba not installed #16981

Skip RoFormer ONNX test if rjieba not installed #16981

Conversation

lewtun commented Apr 28, 2022 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Apr 28, 2022 • edited Loading

patrickvonplaten commented Apr 28, 2022

sgugger commented Apr 28, 2022

ydshieh commented Apr 28, 2022 • edited Loading

Further remark

lewtun commented Apr 28, 2022

lewtun Apr 28, 2022 • edited Loading

Choose a reason for hiding this comment

patrickvonplaten Apr 29, 2022 • edited Loading

Choose a reason for hiding this comment

lewtun Apr 28, 2022 • edited by patrickvonplaten Loading

Choose a reason for hiding this comment

patrickvonplaten Apr 29, 2022

Choose a reason for hiding this comment

lewtun May 3, 2022

Choose a reason for hiding this comment

lewtun commented Apr 28, 2022

sgugger commented Apr 28, 2022

patrickvonplaten left a comment

Choose a reason for hiding this comment

LysandreJik commented Apr 29, 2022

lewtun commented May 3, 2022

lewtun May 3, 2022

Choose a reason for hiding this comment

lewtun commented May 3, 2022

sgugger left a comment

Choose a reason for hiding this comment

lewtun commented Apr 28, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 28, 2022 •

edited

Loading

ydshieh commented Apr 28, 2022 •

edited

Loading

lewtun Apr 28, 2022 •

edited

Loading

patrickvonplaten Apr 29, 2022 •

edited

Loading

lewtun Apr 28, 2022 •

edited by patrickvonplaten

Loading