Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi

McAuliffe, Michael, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger.

In Interspeech, vol. 2017, pp. 498-502. 2017. [PDF]

Whats Unique This paper present the tool to align orthographic transcript at word level and phonene level to the speech. And, it retains trainability with new datasets.

Key Take Aways

MFA has been scuccessfully applied to 29 languages.
Its training process involve following steps:
- 40 iterations of monophone GMM training
- 15 iterations of monophone realignment
- 35 iterations of triphone training
- 15 iterations of triphone realignment
- 35 iterations of speaker-adapted triphone training
- 15 iterations of speaker-adapted realignment
It uses mel-frequency cepstral coefficients (MFCCs) as acoustic features.
Training on LibriSpeech corpus of 1000 hours data took 80 hours.
It is evaluated against the manual annotation, Buckeye Corpus contains 20.7 hours of conversational speech from 40 speakers with manual transcript and boundaries at phone and word level.
Average difference in actual and word level boundary is around 25 ms, and for phoneme it is around 17 ms.
It uses Kaldi toolkit for underlying audio processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MFA.md

MFA.md

Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi

McAuliffe, Michael, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger.

In Interspeech, vol. 2017, pp. 498-502. 2017. [PDF]

Files

MFA.md

Latest commit

History

MFA.md

File metadata and controls

Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi

McAuliffe, Michael, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger.

In Interspeech, vol. 2017, pp. 498-502. 2017. [PDF]