Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Merged
merged 145 commits into from
Jul 18, 2024
Merged
Show file tree
Hide file tree
Changes from 70 commits
Commits
Show all changes
145 commits
Select commit Hold shift + click to select a range
fc54cb9
seed, multilingual and fixes
Jiltseb Jun 9, 2023
84d58fa
added languages in tokenizer
Jiltseb Jun 14, 2023
63bea66
multilingual fixes
Jiltseb Jun 21, 2023
b95d694
vocabulary extension fix for downloads
Jiltseb Jun 21, 2023
a8626bb
code fixes for multilingual
Jiltseb Jun 28, 2023
c2ca8d4
Squash long words at window and sentence boundaries
Jiltseb Jul 4, 2023
9edf960
added commits specifying changes to original package
Jiltseb Jul 26, 2023
d008650
seed, multilingual and fixes
Jiltseb Jun 9, 2023
2573982
added languages in tokenizer
Jiltseb Jun 14, 2023
8add326
multilingual fixes
Jiltseb Jun 21, 2023
afc3f5c
vocabulary extension fix for downloads
Jiltseb Jun 21, 2023
dd55c03
code fixes for multilingual
Jiltseb Jun 28, 2023
d34780e
Squash long words at window and sentence boundaries
Jiltseb Jul 4, 2023
9fab8d9
added commits specifying changes to original package
Jiltseb Jul 26, 2023
162fbf0
modifications based on review
Jiltseb Jul 28, 2023
ca6a2ba
removed LANGUAGES from tokenizer and added numpy requirements
Jiltseb Oct 6, 2023
0df6953
Merge remote-tracking branch 'upstream/master'
Jiltseb Oct 9, 2023
988c528
Merge local master to 'updated_js_v2.1'
Jiltseb Oct 9, 2023
443eb86
Merge pull request #1 from mobiusml/js_asr_v2.1_pr
Jiltseb Oct 9, 2023
6a51407
Update requirements.txt
Jiltseb Oct 9, 2023
4138e16
Merge pull request #2 from SYSTRAN/master
Jiltseb Dec 12, 2023
b906a98
changes to README.md
Jiltseb Dec 13, 2023
0464122
Added BatchedInferencePipeline
Jiltseb Dec 13, 2023
78b5cd7
Added language detection from multiple segments and batched inference…
Jiltseb Dec 13, 2023
f397e37
added additional packages
Jiltseb Dec 13, 2023
83895ac
changes to batched inference based on the review
Jiltseb Dec 20, 2023
e1c1699
change in silence detection
Jiltseb Dec 21, 2023
b516bc8
Merge pull request #3 from mobiusml/batched_asr
Jiltseb Dec 22, 2023
3477d86
Merge pull request #4 from SYSTRAN/master
Jiltseb Jan 22, 2024
95df9eb
added logic for torchaudio based feature extraction
Jiltseb Jan 23, 2024
0cc2d1d
added requirements
Jiltseb Jan 23, 2024
d6624ff
added feature extraction in README
Jiltseb Jan 23, 2024
fa69694
Merge pull request #5 from mobiusml/add_new_feat_extract
Jiltseb Jan 23, 2024
6698a9a
removing unwanted dataclasses and non-generator transcribe function, …
Jiltseb Mar 19, 2024
1b6376f
Merge remote-tracking branch systran/faster_whisper 'upstream/master'…
Jiltseb Mar 19, 2024
92867e3
uses same type annotation as faster_whisper for batched transcribe, c…
Jiltseb Mar 25, 2024
8452cf2
added jsons for dict conversion
Jiltseb Mar 25, 2024
4535963
made vad_segments as optional parameter, modified docstring
Jiltseb Mar 25, 2024
95671d2
made default batched asr options optional as this can be taken care d…
Jiltseb Mar 25, 2024
5fa21b8
Merge pull request #7 from mobiusml/fixes_and_update
Jiltseb Mar 26, 2024
b421086
Update requirements.txt
Jiltseb Mar 26, 2024
16d54e5
Update requirements.txt
Jiltseb Mar 26, 2024
827df36
Update requirements.txt
Jiltseb Mar 27, 2024
911c62d
Update requirements.txt
Jiltseb Mar 27, 2024
fcf8519
merging with systran fw
Jiltseb Apr 8, 2024
e288337
adding vad model and defaults for language detection
Jiltseb Apr 8, 2024
9c85222
adding utility functions for vad model
Jiltseb Apr 8, 2024
21f4640
add pyannote dependency
Jiltseb Apr 8, 2024
eff5e23
adding VAD model, tests and update README
Jiltseb Apr 9, 2024
caaa593
update requirements
Jiltseb Apr 10, 2024
538366b
Merge pull request #8 from mobiusml/fw_pr
Jiltseb Apr 11, 2024
c41e4f2
added 'use_vad_model' to better handle vad segments
Jiltseb Apr 12, 2024
0e8fa00
Update error message
Jiltseb Apr 12, 2024
0d6c62e
Merge pull request #9 from mobiusml/fw_pr
Jiltseb Apr 12, 2024
56d68a1
added gpu implementation for vad by default
Jiltseb Apr 28, 2024
2812d99
adding a vad_device, modifying vad_url
Jiltseb Apr 29, 2024
1cd3c60
adding get_device function
Jiltseb Apr 29, 2024
3f27636
Merge pull request #10 from mobiusml/fw_pr_compliance
Jiltseb Apr 29, 2024
93c327d
updating the fork
Jiltseb May 17, 2024
2152d11
Merge remote-tracking branch 'upstream/master' into pr_expt
Jiltseb May 22, 2024
10242fc
updated version, credits to whisper-x, model made optional
Jiltseb May 22, 2024
2dde3c9
Merge branch 'master' into fw_compliance
Jiltseb May 22, 2024
8fd2ec0
Merge pull request #11 from mobiusml/fw_compliance
Jiltseb May 24, 2024
0fd5003
added compatibility for python 3.8
Jiltseb May 24, 2024
9d70f0f
Reformatted the code
Jiltseb May 24, 2024
d263cbd
Merge pull request #12 from mobiusml/fw_compliance
Jiltseb May 24, 2024
c9e5f3b
making default vad_device same as asr model device
Jiltseb May 24, 2024
883be4d
added docstring
Jiltseb May 24, 2024
18bdaa8
added docstring
Jiltseb May 24, 2024
b10b8cb
Merge pull request #13 from mobiusml/fw_compliance
Jiltseb May 24, 2024
ce21fc7
Merge remote-tracking branch 'upstream/master'
Jiltseb Jun 11, 2024
afcc0f6
changes after review suggestions: remove redundant info, add vad mode…
Jiltseb Jun 11, 2024
0b63e22
modified timings for edge padding
Jiltseb Jun 12, 2024
e3dc61d
adding word_timestamps fir batched version
Jiltseb Jun 12, 2024
c694174
remove the input dictionary in place modification
Jiltseb Jun 13, 2024
a0d3891
adding model file
Jiltseb Jun 13, 2024
3c22842
Merge pull request #14 from mobiusml/fw_changes
Jiltseb Jun 13, 2024
d30b377
removing clip_timestamps and redundant info, minor typos
Jiltseb Jun 17, 2024
9937ab7
Merge pull request #15 from mobiusml/fw_changes
Jiltseb Jun 17, 2024
5c3e6f2
test scripts for word level timestamps, audios less than chunk_length…
Jiltseb Jun 18, 2024
d1f4a7e
added code validation
Jiltseb Jun 18, 2024
46310af
Merge pull request #16 from mobiusml/fw_changes
Jiltseb Jun 18, 2024
7498451
Update MANIFEST.in to include pyannote asset
hargunmujral Jun 20, 2024
307de38
Merge pull request #17 from hargunmujral/patch-1
Jiltseb Jun 20, 2024
17e30a4
.
MahmoudAshraf97 Jun 20, 2024
46532fc
Merge branch 'mobiusml:master' into master
MahmoudAshraf97 Jun 20, 2024
ad2379b
remove tokenizer reinitialization
MahmoudAshraf97 Jun 20, 2024
abcbedd
remove the need for a separate `encode_batched` function
MahmoudAshraf97 Jun 21, 2024
f584a6c
fix flake8 error
MahmoudAshraf97 Jun 21, 2024
1bd1bf7
Added punctuation changes in word_timestamps, removed jsons requirement
Jiltseb Jun 21, 2024
ebf7b65
enable word timestamps using original functions
MahmoudAshraf97 Jun 21, 2024
7f84e34
* remove `PyAV` and use `torchaudio` instead, this fixes the memory l…
MahmoudAshraf97 Jun 22, 2024
b54d828
added back `np.ndarray` support for `transcribe`
MahmoudAshraf97 Jun 24, 2024
2c617c2
fix wrong padding scheme leading to very high WER
MahmoudAshraf97 Jun 24, 2024
99d61e0
remove `num_workers` argument from batched `transcribe`
MahmoudAshraf97 Jun 24, 2024
aef4b97
generalized word timestamps function
MahmoudAshraf97 Jun 24, 2024
5fc5fca
remove redundant parameters related to `num_workers`
MahmoudAshraf97 Jun 25, 2024
389da33
fix word timestamps for non-batched inference
MahmoudAshraf97 Jun 25, 2024
2b0a252
support `without_timestamps` in batched mode
MahmoudAshraf97 Jun 25, 2024
f03d8ca
adjust tests
MahmoudAshraf97 Jun 25, 2024
7c38429
fix typing hints for older python versions
MahmoudAshraf97 Jun 25, 2024
579da0e
correct timestamps
MahmoudAshraf97 Jun 26, 2024
8642f1d
use original `Segment` instead of `BatchedSegment`
MahmoudAshraf97 Jun 27, 2024
6e47bd3
* added `duration_after_vad`, `all_language_probs` to `info`
MahmoudAshraf97 Jun 27, 2024
537317f
formatting changes
MahmoudAshraf97 Jun 27, 2024
74db8be
.
MahmoudAshraf97 Jun 27, 2024
fcf0e82
remove `float16` conversion in feature extractor as it led to halluci…
MahmoudAshraf97 Jun 27, 2024
9f78b36
enable running benchmark from anywhere
MahmoudAshraf97 Jun 29, 2024
d95c7a6
review feature extraction implementation
MahmoudAshraf97 Jun 29, 2024
968057e
formatting fixes
MahmoudAshraf97 Jun 29, 2024
eff81f5
Merge pull request #18 from MahmoudAshraf97/master
Jiltseb Jul 1, 2024
71fca47
Merge remote-tracking branch 'origin/master' into final_changes
Jiltseb Jul 1, 2024
369f297
black tool reformats
Jiltseb Jul 1, 2024
248d517
Merge remote-tracking branch 'upstream/master' into final_changes
Jiltseb Jul 1, 2024
647c092
revert silero change to master
Jiltseb Jul 1, 2024
923c5d9
moving language_id functions to WhisperModel class and removing other…
Jiltseb Jul 1, 2024
70346ca
evaluate lang_detect to a false boolean if not found
Jiltseb Jul 1, 2024
3235640
review changes
MahmoudAshraf97 Jul 1, 2024
781c051
Merge branch 'mobiusml:master' into master
MahmoudAshraf97 Jul 1, 2024
aea77b1
Merge pull request #21 from MahmoudAshraf97/master
Jiltseb Jul 1, 2024
c26e4e2
Merge remote-tracking branch 'origin/master' into final_changes
Jiltseb Jul 1, 2024
059d849
rename detect_language to detect_langauge_function in WhisperModel
Jiltseb Jul 1, 2024
5c6f6b5
Merge pull request #20 from mobiusml/fw_changes
Jiltseb Jul 1, 2024
3a63df0
fix conflicts with systran master
MahmoudAshraf97 Jul 5, 2024
3271a4a
Merge pull request #23 from MahmoudAshraf97/master
Jiltseb Jul 5, 2024
e57b5ca
Merge remote-tracking branch 'systran_master/master'
MahmoudAshraf97 Jul 5, 2024
2fc6c50
.
MahmoudAshraf97 Jul 5, 2024
8bdbca0
rename `chunk_size` to `chunk_length` for consistency
MahmoudAshraf97 Jul 5, 2024
b94bd93
Merge branch 'master' into master
Jiltseb Jul 5, 2024
fec8c4e
Merge pull request #24 from MahmoudAshraf97/master
Jiltseb Jul 5, 2024
ad080cd
review comments
MahmoudAshraf97 Jul 5, 2024
aef5869
.
MahmoudAshraf97 Jul 5, 2024
9b39b73
fixing docstring
MahmoudAshraf97 Jul 5, 2024
1dcf0c9
Merge pull request #25 from MahmoudAshraf97/master
Jiltseb Jul 5, 2024
e988ac6
fix usage with english-only models
MahmoudAshraf97 Jul 6, 2024
b3c1ace
Merge pull request #26 from MahmoudAshraf97/master
Jiltseb Jul 8, 2024
c51b877
added licensing comments inthe doc and the code
Jiltseb Jul 10, 2024
7a90ab8
Merge pull request #27 from mobiusml/fw_changes
Jiltseb Jul 10, 2024
3fd6f7c
added formatting checks
Jiltseb Jul 10, 2024
6a87d85
Merge pull request #28 from mobiusml/fw_changes
Jiltseb Jul 10, 2024
4681caa
update license info
Jiltseb Jul 11, 2024
62bb5f0
Merge pull request #29 from mobiusml/fw_changes
Jiltseb Jul 11, 2024
bb6696b
.
MahmoudAshraf97 Oct 2, 2024
5e6a426
remove duplicate `detect_language` function
MahmoudAshraf97 Oct 2, 2024
3ffb18f
Merge pull request #22 from MahmoudAshraf97/master
Jiltseb Jul 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[![CI](https:/SYSTRAN/faster-whisper/workflows/CI/badge.svg)](https:/SYSTRAN/faster-whisper/actions?query=workflow%3ACI) [![PyPI version](https://badge.fury.io/py/faster-whisper.svg)](https://badge.fury.io/py/faster-whisper)

# Faster Whisper transcription with CTranslate2
# Mobius Faster Whisper transcription with CTranslate2
trungkienbkhn marked this conversation as resolved.
Show resolved Hide resolved

**faster-whisper** is a reimplementation of OpenAI's Whisper model using [CTranslate2](https:/OpenNMT/CTranslate2/), which is a fast inference engine for Transformer models.

Expand Down Expand Up @@ -166,6 +166,35 @@ for segment in segments:
segments, _ = model.transcribe("audio.mp3")
segments = list(segments) # The transcription will actually run here.
```

### multi-segment language detection

To directly use the model for improved language detection, the following code snippet can be used:

```python
from faster_whisper import WhisperModel
model = WhisperModel("medium", device="cuda", compute_type="float16")
language_info = model.detect_language_multi_segment("audio.mp3")
```

### Batched faster-whisper


The batched version of faster-whisper is inspired by [whisper-x](https:/m-bain/whisperX) licensed under the BSD-4 Clause license. This product includes software developed by Max Bain. We modify this implementation and also added kaldi-based feature extraction. It improves the speed upto 10-12x compared to openAI implementation and 3-4x compared to the sequential faster_whisper version. It works by transcribing semantically meaningful audio chunks as batches leading to faster inference.

The following code snippet illustrates how to run inference with batched version on an example audio file. Please also refer to the test scripts of batched faster whisper.

```python
from faster_whisper import BatchedInferencePipeline
Jiltseb marked this conversation as resolved.
Show resolved Hide resolved

model = WhisperModel("medium", device="cuda", compute_type="float16")
batched_model = BatchedInferencePipeline(model=model)
result = batched_model.transcribe("audio.mp3", batch_size=16)

for segment, info in result:
Jiltseb marked this conversation as resolved.
Show resolved Hide resolved
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
```

### Faster Distil-Whisper

The Distil-Whisper checkpoints are compatible with the Faster-Whisper package. In particular, the latest [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)
Expand Down
3 changes: 2 additions & 1 deletion faster_whisper/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
from faster_whisper.audio import decode_audio
from faster_whisper.transcribe import WhisperModel
from faster_whisper.transcribe import BatchedInferencePipeline, WhisperModel
from faster_whisper.utils import available_models, download_model, format_timestamp
from faster_whisper.version import __version__

__all__ = [
"available_models",
"decode_audio",
"WhisperModel",
"BatchedInferencePipeline",
"download_model",
"format_timestamp",
"__version__",
Expand Down
55 changes: 41 additions & 14 deletions faster_whisper/feature_extractor.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
import numpy as np
import torch
import torchaudio.compliance.kaldi as ta_kaldi


# Adapted from https:/huggingface/transformers/blob/main/src/transformers/models/whisper/feature_extraction_whisper.py # noqa: E501
Expand All @@ -21,6 +23,7 @@ def __init__(
self.mel_filters = self.get_mel_filters(
sampling_rate, n_fft, n_mels=feature_size
)
self.n_mels = feature_size
Jiltseb marked this conversation as resolved.
Show resolved Hide resolved

def get_mel_filters(self, sr, n_fft, n_mels=128, dtype=np.float32):
# Initialize the weights
Expand Down Expand Up @@ -142,29 +145,53 @@ def stft(self, frames, window):
data[f] = np.fft.fft(fft_signal, axis=0)[:num_fft_bins]
return data.T

def __call__(self, waveform, padding=True, chunk_length=None):
def __call__(self, waveform, enable_ta=False, padding=True, chunk_length=None):
"""
Compute the log-Mel spectrogram of the provided audio, gives similar results
whisper's original torch implementation with 1e-5 tolerance.
whisper's original torch implementation with 1e-5 tolerance. Additionally, faster
feature extraction option using kaldi fbank features are available if torchaudio is
available.
"""
if enable_ta:
waveform = waveform.astype(np.float32)

trungkienbkhn marked this conversation as resolved.
Show resolved Hide resolved
if chunk_length is not None:
self.n_samples = chunk_length * self.sampling_rate
self.nb_max_frames = self.n_samples // self.hop_length

if padding:
waveform = np.pad(waveform, [(0, self.n_samples)])

window = np.hanning(self.n_fft + 1)[:-1]

frames = self.fram_wave(waveform)
stft = self.stft(frames, window=window)
magnitudes = np.abs(stft[:, :-1]) ** 2

filters = self.mel_filters
mel_spec = filters @ magnitudes

log_spec = np.log10(np.clip(mel_spec, a_min=1e-10, a_max=None))
log_spec = np.maximum(log_spec, log_spec.max() - 8.0)
log_spec = (log_spec + 4.0) / 4.0
if enable_ta:
audio = torch.from_numpy(waveform).unsqueeze(0)
trungkienbkhn marked this conversation as resolved.
Show resolved Hide resolved
fbank = ta_kaldi.fbank(
audio,
sample_frequency=self.sampling_rate,
window_type="hanning",
num_mel_bins=self.n_mels,
)
log_spec = fbank.numpy().T.astype(np.float32) # ctranslate does not take 64

# normalize

# Audioset values as default mean and std for audio
mean_val = -4.2677393
std_val = 4.5689974
scaled_features = (log_spec - (mean_val)) / (std_val * 2)
trungkienbkhn marked this conversation as resolved.
Show resolved Hide resolved
log_spec = scaled_features

else:
window = np.hanning(self.n_fft + 1)[:-1]

frames = self.fram_wave(waveform)
stft = self.stft(frames, window=window)
magnitudes = np.abs(stft[:, :-1]) ** 2

filters = self.mel_filters
mel_spec = filters @ magnitudes

log_spec = np.log10(np.clip(mel_spec, a_min=1e-10, a_max=None))
log_spec = np.maximum(log_spec, log_spec.max() - 8.0)
log_spec = (log_spec + 4.0) / 4.0

return log_spec
Loading
Loading