Wav2 vec2 phoneme ctc tokenizer optimisation #16817

ArthurZucker · 2022-04-18T13:23:29Z

What does this PR do?

This is my FIRST PR!
The Wav2Vec2PhonemCTCTokenizer is slow when its argument do_phonemize is set to True. It re-initialises the backend at each forward pass. This is adressed using a class argument.

There was also an H4 title in the documentation which had a link which did not render( <h4></h4> used to replace ####)

Tests were passed, no additional ones were created. Runtime experiments to phonemize the entire 'tr' (turkish) subset of the common voice dataset gives a x10 boost in performances.

Models:

Wav2Vec2PhonemeCTCTokenizer: @patrickvonplaten, @LysandreJik

Documentation: @sgugger

Markdown references in headings such as '####' don't render well. Replaced it with <h4>...<a></a></h> banners.

The backend should only be initialized once, otherwise it is reloaded. Added `init_backend` function, intializes a backend attribute. Phonemize re-uses self.backend. Should give ~10 times faster phonemization.

HuggingFaceDocBuilderDev · 2022-04-18T13:37:13Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Congrats on your first PR!
LGTM with a few nits, but let's wait for @patrickvonplaten approval before merging.

src/transformers/models/wav2vec2_phoneme/tokenization_wav2vec2_phoneme.py

CONTRIBUTING.md

patrickvonplaten

Great! Thanks for fixing this :-)

Co-authored-by: Sylvain Gugger <[email protected]>

* Solved href rendering issue in heading Markdown references in headings such as '####' don't render well. Replaced it with <h4>...<a></a></h> banners. * PhonemeTokenizer optimization using phonemizer lib The backend should only be initialized once, otherwise it is reloaded. Added `init_backend` function, intializes a backend attribute. Phonemize re-uses self.backend. Should give ~10 times faster phonemization. * formatted file with make style * Documentation suggestion Co-authored-by: Sylvain Gugger <[email protected]> * Update /tokenization_wav2vec2_phoneme.py based on PR suggestion Co-authored-by: Sylvain Gugger <[email protected]> * Update CONTRIBUTING.md Co-authored-by: Sylvain Gugger <[email protected]>

ArthurZucker added 3 commits April 13, 2022 16:04

Solved href rendering issue in heading

ec7f5aa

Markdown references in headings such as '####' don't render well. Replaced it with <h4>...<a></a></h> banners.

PhonemeTokenizer optimization using phonemizer lib

c29ef89

The backend should only be initialized once, otherwise it is reloaded. Added `init_backend` function, intializes a backend attribute. Phonemize re-uses self.backend. Should give ~10 times faster phonemization.

formatted file with make style

0050f42

sgugger approved these changes Apr 18, 2022

View reviewed changes

src/transformers/models/wav2vec2_phoneme/tokenization_wav2vec2_phoneme.py Outdated Show resolved Hide resolved

src/transformers/models/wav2vec2_phoneme/tokenization_wav2vec2_phoneme.py Outdated Show resolved Hide resolved

CONTRIBUTING.md Outdated Show resolved Hide resolved

patrickvonplaten approved these changes Apr 18, 2022

View reviewed changes

ArthurZucker and others added 3 commits April 19, 2022 00:50

Documentation suggestion

91dcab9

Co-authored-by: Sylvain Gugger <[email protected]>

Update /tokenization_wav2vec2_phoneme.py based on PR suggestion

ad58a53

Co-authored-by: Sylvain Gugger <[email protected]>

Update CONTRIBUTING.md

8381b7c

sgugger merged commit 6de4ee6 into huggingface:main Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wav2 vec2 phoneme ctc tokenizer optimisation #16817

Wav2 vec2 phoneme ctc tokenizer optimisation #16817

ArthurZucker commented Apr 18, 2022

HuggingFaceDocBuilderDev commented Apr 18, 2022 •

edited

Loading

sgugger left a comment •

edited

Loading

patrickvonplaten left a comment

Wav2 vec2 phoneme ctc tokenizer optimisation #16817

Wav2 vec2 phoneme ctc tokenizer optimisation #16817

Conversation

ArthurZucker commented Apr 18, 2022

What does this PR do?

HuggingFaceDocBuilderDev commented Apr 18, 2022 • edited Loading

sgugger left a comment • edited Loading

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Apr 18, 2022 •

edited

Loading

sgugger left a comment •

edited

Loading