Skip to content

Latest commit

 

History

History
35 lines (28 loc) · 2.67 KB

fastpitch.md

File metadata and controls

35 lines (28 loc) · 2.67 KB

Fastpitch: Parallel text-to-speech with pitch prediction.

Łańcucki, Adrian

In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6588-6592. IEEE, 2021 [Arxiv].

Whats Unique Fastpitch is a fully parallel approach for text-to-speech model, it is based on fastspeech and conditioning the fundamental freuqency contours.

How it works

  • It is inspired from fastspeech, where it has two Feed Forward transformer blocks, one in the dimensionality of input, and another in the dimensionality of the output.
  • It uses duration predictor, similar to fastspeech, where it uses a trained Tacotran2 model for the same.
  • It does not require knowledge distillation approach for mel-spectrogram prediction, which is there in the fastspeech.
  • Instead, it has pitch prediction module, which is trained alogn with, where ground truth pitch is derived for each input value.
  • It is similar to fastspeech2 model, where pitch was predicted not for each value, but for each spectogram frame, which makes it bit costly.
  • Architecture diagram is as follow:

Source: Author

  • Ground truth for the pitch prediciton per input value:

Source: Author

  • Predicted pitch is added to the hidden representation from the first FFT block, and after which it is upsampled by the duration predicted for each input value.

\hat{\boldsymbol{d}}=\text { DurationPredictor }(\boldsymbol{h}), \quad \hat{\boldsymbol{p}}=\operatorname{PitchPredictor}(\boldsymbol{h})\\
\begin{aligned}
\boldsymbol{g} &=\boldsymbol{h}+\text { PitchEmbedding }(\boldsymbol{p}) \\
\hat{\boldsymbol{y}} &=\operatorname{FFTr}\left([\underbrace{g_{1}, \ldots, g_{1}}_{d_{1}}, \ldots \underbrace{g_{n}, \ldots, g_{n}}_{d_{n}}]\right) .
\end{aligned}

  • Its inference is 900 times faster than Tacotron2.