Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byte-Pair Encoding tokenizer #2533

Closed
larochef opened this issue Apr 13, 2023 · 3 comments
Closed

Byte-Pair Encoding tokenizer #2533

larochef opened this issue Apr 13, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@larochef
Copy link
Contributor

larochef commented Apr 13, 2023

Description

I would like to use a model from huggingface that uses a BPE tokenizer. This model is flaubert/flaubert_base_uncased.
It can be found at: https://huggingface.co/flaubert/flaubert_base_uncased

It doesn't have a tokenizer.json, so the djl huggingface tokenizers won't work with it.

  • Is there any other way to build this kind of tokenizer with djl?
  • Do you think it would be possible to add some method to build the tokenizer file using the vocabulary.json and the merges.txt in the huggingface wrapper?

Will this change the current api? How?

Maybe it could be just a new method in the HuggingFaceTokenizer to create a tokenizer with the 2 files?

Who will benefit from this enhancement?

anyone using byte-pair encoding tokenizer

References

@larochef larochef added the enhancement New feature or request label Apr 13, 2023
@frankfliu
Copy link
Contributor

@larochef
Yes, we should be able to expose BPE api in our huggingface tokenizer extension.

For the mean time, you should be able to use our sentencepiece extension, it has BPE implementation.

@larochef
Copy link
Contributor Author

I've had a look at sentencepiece, but it feels like I won't be able to easily reuse it as-is, since the training data seem to somewhat differ.

I've also had a look at the rust binding, and it feels like it shouldn't be too hard to add this support, if it's ok, I'll gladly propose something for it, mirroring the way they do it, with a builder

@larochef
Copy link
Contributor Author

I have pushed some MR for it, and here are a few points:

  • I didn't manage to get the builder work, I have faced some memory move issues the I didn't know how to fix properly. I guess it is somewhat linked to what is done in the cast_handle method that doesn't play nice with the builder.
  • For some reason, I can't get the jni link work in unit tests, there should be something I'm missing, but I don't really see what

Tell me what you think of it, and let's see how we can get it fully working!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants