-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer.json compability with ai.djl.huggingface tokenizers broken for sentencepiece based models #31086
Comments
The older tokenizer file saved on tokenizer.json on the model hub imports fine.
There is one new introduced parameter
|
Have they fixed it on HEAD, if so I guess 0.28 will be out soon. |
Created deepjavalibrary/djl#3141 |
Workaround to patch the tokenizer.json file to remove the key vespa-engine/sample-apps#1421 |
deepjavalibrary upgraded to 0.28 in #31216 |
At some point, the transformers library tokenizer save_pretrained tokenizer.json changed for sentencepiece based tokenizer models, which is incompatible with the djl tokenizer implementation we depend on. We use 0.27.0 which is the latest version.
Importing a tokenizer.json file from a recent Transformer version will prevent the embedder from starting because the djl tokenizer doesn't understand the format.
For example this config, pointing to a tokenizer file exported from intfloat/multilingual-e5-small) using either
optimum-cli
or our export tooling into:Will fail with the following
The text was updated successfully, but these errors were encountered: