Segment timestamps are buggy in BatchedInferencePipeline #919

kalradivyanshu · 2024-07-20T20:34:14Z

In the recent code #856, BatchedInferencePipeline the segments are sometimes wrong for some reason.

I recorded an audio clip: https://drive.google.com/file/d/1cbDbiXi12SIsd0hIDfs61VdtgI78Fg_p/view?usp=sharing

and for this BatchedInferencePipeline gives Segment(id=1, seek=2307, start=18.71, end=23.07... even though the clip is only 3 second long.

Here is a colab recreating this issue, just upload batch.wav from the link above: https://colab.research.google.com/drive/1ie7uMFW_LJUvxGHW3KkT5iG8uUZiTVwU?usp=sharing

segments, info = batched_model.transcribe(arr, word_timestamps=True, batch_size = 1)
print(list(segments))

output:

[Segment(id=1, seek=2307, start=18.71, end=23.07, text=' Hey Michael, how are you?', tokens=[14690, 3899, 11, 703, 389, 345, 30], avg_logprob=-0.21215820871293545, compression_ratio=0.7647058823529411, no_speech_prob=0.08087158203125, words=[Word(start=18.71, end=20.11, word=' Hey', probability=0.71240234375), Word(start=20.11, end=20.11, word=' Michael,', probability=0.732421875), Word(start=22.71, end=22.71, word=' how', probability=0.98046875), Word(start=22.71, end=22.71, word=' are', probability=0.9990234375), Word(start=22.71, end=23.07, word=' you?', probability=0.99853515625)], temperature=1.0)]

segments, info = model.transcribe(arr, word_timestamps=True)
print(list(segments))

correct output:

[Segment(id=1, seek=330, start=0.9999999999999996, end=2.5, text=' Hey Michael, how are you?', tokens=[50413, 14690, 3899, 11, 703, 389, 345, 30, 50513], avg_logprob=-0.568749976158142, compression_ratio=0.7575757575757576, no_speech_prob=0.078125, words=[Word(start=0.9999999999999996, end=1.4, word=' Hey', probability=0.7177734375), Word(start=1.4, end=1.6, word=' Michael,', probability=0.68896484375), Word(start=2.04, end=2.14, word=' how', probability=0.98046875), Word(start=2.14, end=2.32, word=' are', probability=0.998046875), Word(start=2.32, end=2.5, word=' you?', probability=0.99755859375)], temperature=0.0)]

Thank you for all your work!

The text was updated successfully, but these errors were encountered:

kalradivyanshu · 2024-07-20T22:45:04Z

Did some digging around, I think the words timestamp is the problem.

In these lines:
https:/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py#L1794-L1814

The last word's end timestamp is used for segments end timestamp. Which points to the error being in the find_alignment function:

faster-whisper/faster_whisper/transcribe.py

Line 1820 in eb83902

def find_alignment(

This can be confirmed by the fact that if I don't send word_timestamps=True the segment timestamps are correct.

Jiltseb · 2024-07-21T07:20:26Z

@MahmoudAshraf97

MahmoudAshraf97 · 2024-07-21T13:28:18Z

should be fixed in #920

kalradivyanshu · 2024-07-21T14:29:43Z

Hey @MahmoudAshraf97 Thank you for your quick response and fix!

I have been able to recreate it in your code, but for a specific case:
audio: https://drive.google.com/file/d/1oK0x7OF_JrfWa7ot1-bkTSP1m_tP_vEk/view?usp=sharing

colab: https://colab.research.google.com/drive/1ie7uMFW_LJUvxGHW3KkT5iG8uUZiTVwU?usp=sharing

In this, it works if I use batched_model.transcribe's VAD, but if I pass in VAD segments specifically: [{'start': 0.0, 'end': 30.0, 'segments': [(0.832, 7.024000000000001)] It breaks:

[Segment(id=1, seek=2998, start=0.0, end=29.98, text=" Good morning, Hank, it's Tuesday. You know how they're those videos that are like so-and-so answers the web's most searched questions about them, and", tokens=[4599, 3329, 11, 24386, 11, 340, 338, 3431, 13, 921, 760, 703, 484, 821, 883, 5861, 326, 389, 588, 523, 12, 392, 12, 568, 7429, 262, 3992, 338, 749, 16499, 2683, 546, 606, 11, 290], avg_logprob=-0.22146267195542654, compression_ratio=1.2295081967213115, no_speech_prob=0.03997802734375, words=[Word(start=0.0, end=0.1, word=' Good', probability=0.87646484375), Word(start=0.1, end=0.3, word=' morning,', probability=0.91845703125), Word(start=0.34, end=0.46, word=' Hank,', probability=0.409423828125), Word(start=0.46, end=0.62, word=" it's", probability=0.983154296875), Word(start=0.62, end=0.74, word=' Tuesday.', probability=0.99755859375), Word(start=1.02, end=1.02, word=' You', probability=0.97119140625), Word(start=1.02, end=1.1, word=' know', probability=0.99658203125), Word(start=1.1, end=1.22, word=' how', probability=0.87548828125), Word(start=1.22, end=1.4, word=" they're", probability=0.70458984375), Word(start=1.4, end=1.54, word=' those', probability=0.943359375), Word(start=1.54, end=1.86, word=' videos', probability=0.99560546875), Word(start=1.86, end=2.1, word=' that', probability=0.974609375), Word(start=2.1, end=2.22, word=' are', probability=0.9951171875), Word(start=2.22, end=2.38, word=' like', probability=0.66796875), Word(start=2.38, end=2.68, word=' so', probability=0.7060546875), Word(start=2.68, end=2.82, word='-and', probability=0.852294921875), Word(start=2.82, end=29.98, word='-so', probability=0.996337890625), Word(start=29.98, end=29.98, word=' answers', probability=0.98046875), Word(start=29.98, end=29.98, word=' the', probability=0.962890625), Word(start=29.98, end=29.98, word=" web's", probability=0.869140625), Word(start=29.98, end=29.98, word=' most', probability=0.9443359375), Word(start=29.98, end=29.98, word=' searched', probability=0.9453125), Word(start=29.98, end=29.98, word=' questions', probability=0.99365234375), Word(start=29.98, end=29.98, word=' about', probability=0.9990234375), Word(start=29.98, end=29.98, word=' them,', probability=0.9970703125), Word(start=29.98, end=29.98, word=' and', probability=0.9873046875)], temperature=1.0)]

However if I pass the same segments into normal transcribe, using clip_timestamps:

segments, info = model.transcribe(arr, word_timestamps=True, clip_timestamps = [0.832, 7.024000000000001], vad_filter=False)

That works correctly:

[Segment(id=1, seek=702, start=0.83, end=5.93, text=" You know how they're those videos that are like so-and-so answers the web's most searched questions about them and", tokens=[50363, 921, 760, 703, 484, 821, 883, 5861, 326, 389, 588, 523, 12, 392, 12, 568, 7429, 262, 3992, 338, 749, 16499, 2683, 546, 606, 290, 50619], avg_logprob=-0.22739954824958528, compression_ratio=1.1875, no_speech_prob=0.02056884765625, words=[Word(start=0.83, end=0.99, word=' You', probability=0.37646484375), Word(start=0.99, end=1.11, word=' know', probability=0.99267578125), Word(start=1.11, end=1.23, word=' how', probability=0.81787109375), Word(start=1.23, end=1.37, word=" they're", probability=0.66845703125), Word(start=1.37, end=1.55, word=' those', probability=0.8408203125), Word(start=1.55, end=1.87, word=' videos', probability=0.990234375), Word(start=1.87, end=2.11, word=' that', probability=0.96337890625), Word(start=2.11, end=2.21, word=' are', probability=0.9931640625), Word(start=2.21, end=2.37, word=' like', probability=0.8662109375), Word(start=2.37, end=2.69, word=' so', probability=0.7451171875), Word(start=2.69, end=2.83, word='-and', probability=0.7646484375), Word(start=2.83, end=2.99, word='-so', probability=0.99755859375), Word(start=2.99, end=3.35, word=' answers', probability=0.96826171875), Word(start=3.35, end=3.67, word=' the', probability=0.95703125), Word(start=3.67, end=4.07, word=" web's", probability=0.839111328125), Word(start=4.07, end=4.25, word=' most', probability=0.970703125), Word(start=4.25, end=4.51, word=' searched', probability=0.943359375), Word(start=4.51, end=4.87, word=' questions', probability=0.99462890625), Word(start=4.87, end=5.47, word=' about', probability=0.9970703125), Word(start=5.47, end=5.77, word=' them', probability=0.99755859375), Word(start=5.77, end=5.93, word=' and', probability=0.55078125)], temperature=0.0)]

I think its some edge condition :/

MahmoudAshraf97 · 2024-07-21T14:36:03Z

because the input is not the same
the end of a vad segment should never be more than the end of its last subsegment, in your case the end is 30.0 while the subsegment end is 7.024
this is equivalent to
clip_timestamps = [0.0, 30.0]

kalradivyanshu · 2024-07-21T15:32:26Z

Oh ok, my bad, I fixed that, and it is working now!

kalradivyanshu mentioned this issue Jul 20, 2024

Need ability to send multiple files in one go #915

Open

MahmoudAshraf97 mentioned this issue Jul 21, 2024

fix word timestamps for batched inference #920

Closed

MahmoudAshraf97 mentioned this issue Jul 21, 2024

Remove the usage of transformers.pipeline from BatchedInferencePipeline and fix word timestamps for batched inference #921

Merged

trungkienbkhn closed this as completed in #921 Jul 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segment timestamps are buggy in BatchedInferencePipeline #919

Segment timestamps are buggy in BatchedInferencePipeline #919

kalradivyanshu commented Jul 20, 2024 •

edited

Loading

kalradivyanshu commented Jul 20, 2024 •

edited

Loading

Jiltseb commented Jul 21, 2024

MahmoudAshraf97 commented Jul 21, 2024

kalradivyanshu commented Jul 21, 2024

MahmoudAshraf97 commented Jul 21, 2024

kalradivyanshu commented Jul 21, 2024

Segment timestamps are buggy in BatchedInferencePipeline #919

Segment timestamps are buggy in BatchedInferencePipeline #919

Comments

kalradivyanshu commented Jul 20, 2024 • edited Loading

kalradivyanshu commented Jul 20, 2024 • edited Loading

Jiltseb commented Jul 21, 2024

MahmoudAshraf97 commented Jul 21, 2024

kalradivyanshu commented Jul 21, 2024

MahmoudAshraf97 commented Jul 21, 2024

kalradivyanshu commented Jul 21, 2024

kalradivyanshu commented Jul 20, 2024 •

edited

Loading

kalradivyanshu commented Jul 20, 2024 •

edited

Loading