-
Notifications
You must be signed in to change notification settings - Fork 26.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip RoFormer ONNX test if rjieba not installed #16981
Conversation
The documentation is not available anymore as the PR was closed or merged. |
94578a0
to
5ab3067
Compare
So this test is then currently just skipped on our ONNX tests? Should we maybe not better add |
It looks like RoFormer tokenization is completely untested yes, so this package should be added in the |
I agree we should include Further remarkIf we could not add |
Thanks for the feedback - I'll add |
@@ -71,3 +71,11 @@ def test_training_new_tokenizer(self): | |||
# can't train new_tokenizer via Tokenizers lib | |||
def test_training_new_tokenizer_with_special_tokens_change(self): | |||
pass | |||
|
|||
# can't serialise custom PreTokenizer | |||
def test_save_slow_from_fast_and_reload_fast(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was failing with Exception: Custom PreTokenizer cannot be serialized
:
def test_save_slow_from_fast_and_reload_fast(self):
if not self.test_slow_tokenizer or not self.test_rust_tokenizer:
# we need both slow and fast versions
return
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
with tempfile.TemporaryDirectory() as tmp_dir_1:
# Here we check that even if we have initialized a fast tokenizer with a tokenizer_file we can
# still save only the slow version and use these saved files to rebuild a tokenizer
tokenizer_fast_old_1 = self.rust_tokenizer_class.from_pretrained(
pretrained_name, **kwargs, use_fast=True
)
tokenizer_file = os.path.join(tmp_dir_1, "tokenizer.json")
> tokenizer_fast_old_1.backend_tokenizer.save(tokenizer_file)
E Exception: Custom PreTokenizer cannot be serialized
Looking at this similar issue huggingface/tokenizers#613 it seems that RoFormer belong to the class of tokenizers that can't be saved with tokenizer.backend_tokenizer.save()
. Here's an example to reproduce:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("junnyu/roformer_chinese_small", use_fast=True)
tokenizer.backend_tokenizer.save(".")
I'm not very familiar with RoFormer, so happy to debug this further if we expect this test really should pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one I think we can definitely skip since this tokenizer cannot be saved in the "fast" format
pass | ||
|
||
# can't serialise custom PreTokenizer | ||
def test_saving_tokenizer_trainer(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails for a different reason - as far as I can tell, saving the fast tokenizer and loading it again fails because vocab_file
is None
on the init:
src/transformers/tokenization_utils_base.py:1783: in from_pretrained
return cls._from_pretrained(
src/transformers/tokenization_utils_base.py:1809: in _from_pretrained
slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
src/transformers/tokenization_utils_base.py:1918: in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
src/transformers/models/roformer/tokenization_roformer.py:145: in __init__
if not os.path.isfile(vocab_file):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
path = None
def isfile(path):
"""Test whether a path is a regular file"""
try:
> st = os.stat(path)
E TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
Here's an example to reproduce:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("junnyu/roformer_chinese_small", use_fast=True)
tokenizer.save_pretrained("./tmp/roformer", legacy_format=False)
AutoTokenizer.from_pretrained("./tmp/roformer/")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you discovered a bug here @lewtun !
I've done some digging and I think the bug is that the vocab file is not correctly defined for RoFormer I think.
Could you replace
VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"} |
by:
VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt", "tokenizer_file": "tokenizer.json"}
This should make your codesnippet above work and hopefully also make the test pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the advice! Fixed in 1abc58a
With this fix, the following slow tests now all pass:
RUN_SLOW=1 pytest tests/roformer/test_tokenization_roformer.py -s
Hey @sgugger @patrickvonplaten I'm hitting some peculiar issues with 2 of the slow tests of the RoFormer tokenizer. Would you mind taking a look and seeing if my decision to skip them is valid? |
I'll let @patrickvonplaten decide as I know nothing on that model too :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we can fix 1 test by correcting a bug as mentioned in a comment. Happy to skip the other test and merge afterward :-)
There is a test dedicated to custom tokenizers with specific dependencies: https:/huggingface/transformers/blob/main/.circleci/config.yml#L538 It installs |
Thanks for the tip! Done in 3cafcb2 |
@@ -549,7 +549,7 @@ jobs: | |||
- v0.4-custom_tokenizers-{{ checksum "setup.py" }} | |||
- v0.4-{{ checksum "setup.py" }} | |||
- run: pip install --upgrade pip | |||
- run: pip install .[ja,testing,sentencepiece,jieba,spacy,ftfy] | |||
- run: pip install .[ja,testing,sentencepiece,jieba,spacy,ftfy,rjieba] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds rjieba
to the custom tokenizer tests on CircleCI
Hey @patrickvonplaten @LysandreJik I think this PR is ready for a final pass :) The failing test is unrelated to the PR itself (a failing Pegasus generate test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to rebase for the move of the test files. Otherwise LGTM!
* Skip RoFormer ONNX test if rjieba not installed * Update deps table * Skip RoFormer serialization test * Fix RoFormer vocab * Add rjieba to CircleCI
* Skip RoFormer ONNX test if rjieba not installed * Update deps table * Skip RoFormer serialization test * Fix RoFormer vocab * Add rjieba to CircleCI
* Skip RoFormer ONNX test if rjieba not installed * Update deps table * Skip RoFormer serialization test * Fix RoFormer vocab * Add rjieba to CircleCI
* Skip RoFormer ONNX test if rjieba not installed * Update deps table * Skip RoFormer serialization test * Fix RoFormer vocab * Add rjieba to CircleCI
What does this PR do?
This PR adds the
@require_rjieba
decorator to the slow ONNX tests to deal with the following error in our daily CI runs:I wasn't sure ifrjieba
should actually be installed in the GitHub workflow, but it doesn't seem to be the case for the RoFormer tests and so I omitted that for now.Edit: I've added
rjieba
to the"tests"
extras and also tested that the slow ONNX test passes when this dep is installed: