Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segment timestamps are buggy in BatchedInferencePipeline #919

Closed
kalradivyanshu opened this issue Jul 20, 2024 · 6 comments · Fixed by #921
Closed

Segment timestamps are buggy in BatchedInferencePipeline #919

kalradivyanshu opened this issue Jul 20, 2024 · 6 comments · Fixed by #921

Comments

@kalradivyanshu
Copy link

kalradivyanshu commented Jul 20, 2024

In the recent code #856, BatchedInferencePipeline the segments are sometimes wrong for some reason.

I recorded an audio clip: https://drive.google.com/file/d/1cbDbiXi12SIsd0hIDfs61VdtgI78Fg_p/view?usp=sharing

and for this BatchedInferencePipeline gives Segment(id=1, seek=2307, start=18.71, end=23.07... even though the clip is only 3 second long.

Here is a colab recreating this issue, just upload batch.wav from the link above: https://colab.research.google.com/drive/1ie7uMFW_LJUvxGHW3KkT5iG8uUZiTVwU?usp=sharing

segments, info = batched_model.transcribe(arr, word_timestamps=True, batch_size = 1)
print(list(segments))

output:

[Segment(id=1, seek=2307, start=18.71, end=23.07, text=' Hey Michael, how are you?', tokens=[14690, 3899, 11, 703, 389, 345, 30], avg_logprob=-0.21215820871293545, compression_ratio=0.7647058823529411, no_speech_prob=0.08087158203125, words=[Word(start=18.71, end=20.11, word=' Hey', probability=0.71240234375), Word(start=20.11, end=20.11, word=' Michael,', probability=0.732421875), Word(start=22.71, end=22.71, word=' how', probability=0.98046875), Word(start=22.71, end=22.71, word=' are', probability=0.9990234375), Word(start=22.71, end=23.07, word=' you?', probability=0.99853515625)], temperature=1.0)]
segments, info = model.transcribe(arr, word_timestamps=True)
print(list(segments))

correct output:

[Segment(id=1, seek=330, start=0.9999999999999996, end=2.5, text=' Hey Michael, how are you?', tokens=[50413, 14690, 3899, 11, 703, 389, 345, 30, 50513], avg_logprob=-0.568749976158142, compression_ratio=0.7575757575757576, no_speech_prob=0.078125, words=[Word(start=0.9999999999999996, end=1.4, word=' Hey', probability=0.7177734375), Word(start=1.4, end=1.6, word=' Michael,', probability=0.68896484375), Word(start=2.04, end=2.14, word=' how', probability=0.98046875), Word(start=2.14, end=2.32, word=' are', probability=0.998046875), Word(start=2.32, end=2.5, word=' you?', probability=0.99755859375)], temperature=0.0)]

Thank you for all your work!

@kalradivyanshu
Copy link
Author

kalradivyanshu commented Jul 20, 2024

Did some digging around, I think the words timestamp is the problem.

In these lines:
https:/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py#L1794-L1814

The last word's end timestamp is used for segments end timestamp. Which points to the error being in the find_alignment function:

def find_alignment(

This can be confirmed by the fact that if I don't send word_timestamps=True the segment timestamps are correct.

@Jiltseb
Copy link
Contributor

Jiltseb commented Jul 21, 2024

@MahmoudAshraf97

@MahmoudAshraf97
Copy link
Contributor

should be fixed in #920

@kalradivyanshu
Copy link
Author

Hey @MahmoudAshraf97 Thank you for your quick response and fix!

I have been able to recreate it in your code, but for a specific case:
audio: https://drive.google.com/file/d/1oK0x7OF_JrfWa7ot1-bkTSP1m_tP_vEk/view?usp=sharing

colab: https://colab.research.google.com/drive/1ie7uMFW_LJUvxGHW3KkT5iG8uUZiTVwU?usp=sharing

In this, it works if I use batched_model.transcribe's VAD, but if I pass in VAD segments specifically: [{'start': 0.0, 'end': 30.0, 'segments': [(0.832, 7.024000000000001)] It breaks:

[Segment(id=1, seek=2998, start=0.0, end=29.98, text=" Good morning, Hank, it's Tuesday. You know how they're those videos that are like so-and-so answers the web's most searched questions about them, and", tokens=[4599, 3329, 11, 24386, 11, 340, 338, 3431, 13, 921, 760, 703, 484, 821, 883, 5861, 326, 389, 588, 523, 12, 392, 12, 568, 7429, 262, 3992, 338, 749, 16499, 2683, 546, 606, 11, 290], avg_logprob=-0.22146267195542654, compression_ratio=1.2295081967213115, no_speech_prob=0.03997802734375, words=[Word(start=0.0, end=0.1, word=' Good', probability=0.87646484375), Word(start=0.1, end=0.3, word=' morning,', probability=0.91845703125), Word(start=0.34, end=0.46, word=' Hank,', probability=0.409423828125), Word(start=0.46, end=0.62, word=" it's", probability=0.983154296875), Word(start=0.62, end=0.74, word=' Tuesday.', probability=0.99755859375), Word(start=1.02, end=1.02, word=' You', probability=0.97119140625), Word(start=1.02, end=1.1, word=' know', probability=0.99658203125), Word(start=1.1, end=1.22, word=' how', probability=0.87548828125), Word(start=1.22, end=1.4, word=" they're", probability=0.70458984375), Word(start=1.4, end=1.54, word=' those', probability=0.943359375), Word(start=1.54, end=1.86, word=' videos', probability=0.99560546875), Word(start=1.86, end=2.1, word=' that', probability=0.974609375), Word(start=2.1, end=2.22, word=' are', probability=0.9951171875), Word(start=2.22, end=2.38, word=' like', probability=0.66796875), Word(start=2.38, end=2.68, word=' so', probability=0.7060546875), Word(start=2.68, end=2.82, word='-and', probability=0.852294921875), Word(start=2.82, end=29.98, word='-so', probability=0.996337890625), Word(start=29.98, end=29.98, word=' answers', probability=0.98046875), Word(start=29.98, end=29.98, word=' the', probability=0.962890625), Word(start=29.98, end=29.98, word=" web's", probability=0.869140625), Word(start=29.98, end=29.98, word=' most', probability=0.9443359375), Word(start=29.98, end=29.98, word=' searched', probability=0.9453125), Word(start=29.98, end=29.98, word=' questions', probability=0.99365234375), Word(start=29.98, end=29.98, word=' about', probability=0.9990234375), Word(start=29.98, end=29.98, word=' them,', probability=0.9970703125), Word(start=29.98, end=29.98, word=' and', probability=0.9873046875)], temperature=1.0)]

However if I pass the same segments into normal transcribe, using clip_timestamps:

segments, info = model.transcribe(arr, word_timestamps=True, clip_timestamps = [0.832, 7.024000000000001], vad_filter=False)

That works correctly:

[Segment(id=1, seek=702, start=0.83, end=5.93, text=" You know how they're those videos that are like so-and-so answers the web's most searched questions about them and", tokens=[50363, 921, 760, 703, 484, 821, 883, 5861, 326, 389, 588, 523, 12, 392, 12, 568, 7429, 262, 3992, 338, 749, 16499, 2683, 546, 606, 290, 50619], avg_logprob=-0.22739954824958528, compression_ratio=1.1875, no_speech_prob=0.02056884765625, words=[Word(start=0.83, end=0.99, word=' You', probability=0.37646484375), Word(start=0.99, end=1.11, word=' know', probability=0.99267578125), Word(start=1.11, end=1.23, word=' how', probability=0.81787109375), Word(start=1.23, end=1.37, word=" they're", probability=0.66845703125), Word(start=1.37, end=1.55, word=' those', probability=0.8408203125), Word(start=1.55, end=1.87, word=' videos', probability=0.990234375), Word(start=1.87, end=2.11, word=' that', probability=0.96337890625), Word(start=2.11, end=2.21, word=' are', probability=0.9931640625), Word(start=2.21, end=2.37, word=' like', probability=0.8662109375), Word(start=2.37, end=2.69, word=' so', probability=0.7451171875), Word(start=2.69, end=2.83, word='-and', probability=0.7646484375), Word(start=2.83, end=2.99, word='-so', probability=0.99755859375), Word(start=2.99, end=3.35, word=' answers', probability=0.96826171875), Word(start=3.35, end=3.67, word=' the', probability=0.95703125), Word(start=3.67, end=4.07, word=" web's", probability=0.839111328125), Word(start=4.07, end=4.25, word=' most', probability=0.970703125), Word(start=4.25, end=4.51, word=' searched', probability=0.943359375), Word(start=4.51, end=4.87, word=' questions', probability=0.99462890625), Word(start=4.87, end=5.47, word=' about', probability=0.9970703125), Word(start=5.47, end=5.77, word=' them', probability=0.99755859375), Word(start=5.77, end=5.93, word=' and', probability=0.55078125)], temperature=0.0)]

I think its some edge condition :/

@MahmoudAshraf97
Copy link
Contributor

because the input is not the same
the end of a vad segment should never be more than the end of its last subsegment, in your case the end is 30.0 while the subsegment end is 7.024
this is equivalent to
clip_timestamps = [0.0, 30.0]

@kalradivyanshu
Copy link
Author

Oh ok, my bad, I fixed that, and it is working now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants