Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add truncation and padding setter APIs to tokenizers #1870

Merged

Conversation

siddvenk
Copy link
Contributor

@siddvenk siddvenk commented Aug 4, 2022

Description

Adds truncation and padding support in parity with the fast tokenizers provided by hugging face tokenizers and transformers library.

This change adds the relevant java apis and makes it possible for users to configure truncation and padding behavior after loading their tokenizer model.

The rust APIs allow for more customization than this implementation, the python tokenizer implementation, and python transofrmer implementation allow. If users need more customization than what we provide, we can extend our api to allow it.

After this change, the next step is to add functionality to load truncation and padding behavior from the serving.properties file so that these configurations can be set during initialization. A subsequent PR will address that functionality.

@codecov-commenter
Copy link

codecov-commenter commented Aug 4, 2022

Codecov Report

Merging #1870 (8396f71) into master (bb5073f) will decrease coverage by 2.31%.
The diff coverage is 64.69%.

@@             Coverage Diff              @@
##             master    #1870      +/-   ##
============================================
- Coverage     72.08%   69.77%   -2.32%     
- Complexity     5126     5549     +423     
============================================
  Files           473      527      +54     
  Lines         21970    24501    +2531     
  Branches       2351     2667     +316     
============================================
+ Hits          15838    17095    +1257     
- Misses         4925     6095    +1170     
- Partials       1207     1311     +104     
Impacted Files Coverage Δ
api/src/main/java/ai/djl/modality/cv/Image.java 69.23% <ø> (-4.11%) ⬇️
...rc/main/java/ai/djl/modality/cv/MultiBoxPrior.java 76.00% <ø> (ø)
...rc/main/java/ai/djl/modality/cv/output/Joints.java 71.42% <ø> (ø)
.../main/java/ai/djl/modality/cv/output/Landmark.java 100.00% <ø> (ø)
...main/java/ai/djl/modality/cv/output/Rectangle.java 72.41% <0.00%> (ø)
...i/djl/modality/cv/translator/BigGANTranslator.java 21.42% <0.00%> (-5.24%) ⬇️
...odality/cv/translator/BigGANTranslatorFactory.java 33.33% <0.00%> (+8.33%) ⬆️
...nslator/InstanceSegmentationTranslatorFactory.java 14.28% <0.00%> (-3.90%) ⬇️
.../cv/translator/SemanticSegmentationTranslator.java 0.00% <0.00%> (ø)
.../cv/translator/StyleTransferTranslatorFactory.java 40.00% <ø> (ø)
... and 422 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@siddvenk siddvenk merged commit 641d718 into deepjavalibrary:master Aug 4, 2022
@siddvenk siddvenk deleted the tokenizer-truncation-padding branch August 5, 2022 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants