Skip to content

Architecture

Sean Naren edited this page Jun 12, 2016 · 1 revision

DeepSpeech is the name of the architecture used by Baidu in their own end-to-end speech recognition software. In this project we try to replicate their network. Below is a brief overview at a high level.

The architecture can be broken into two sections; a convolution and recurrent network. Below are explanations to each separate portions and tips on modifying the architecture.

##Conv

Baidu experiment with up to 3 convolutional layers, but referring to the DeepSpeech2 benchmark they have chosen two convolutional layers. The implementation in this project is different as they follow the chosen kernel sizes and strides found in the DeepSpeech2 paper. However this is subject to change.

We also assume that there is batch normalization between each layer and the non-linearity of all layers is ReLU (specifically Clipped ReLU with ceiling of 20, as found in the architecture description of the benchmark).

Conv layer Number of Output Feature Maps Kernel Size Stride
1 32 41 x 11 2, 2
2 32 21 x 11 2, 1

Explained to me the idea of the chosen kernel width size 41:

41 = 20 + 1 + 20

This means there is a +/- 20 context window on either side of the point.

##Recurrent

Baidu experiment with GRU/LSTMs and vanilla RNNs. For the most part our implementation uses LSTMs, however Baidus' systems tend toward GRUs or even vanilla RNNs. They contain whats called sequence-wise batch normalization which is described below for a vanilla RNN (bias for the input matrix multiplication has been removed, since batch norm cancels this out):

ht = ReLU(B(W[i]x[t]) + R[i]h[t-1] + b[Ri])

The project implements batch normalization using a decorator around the cuDNN implementations of recurrent layers described here.

Due to the smaller size of AN4, increasing the size didn't have a large effect on the accuracy of the network. Larger datasets especially when converting to a production level network will need a larger number of recurrent layers, as well as larger dimensions to maintain an acceptable level of accuracy.

Baidu use around 7 recurrent layers, and in the deep speech 2 benchmark all recurrent layers have the dimensions of 1760. Be warned this allocates a large amount of VRAM.

After communication with developers over at Baidu, the deep speech architecture has moved away from bi-directional. It might be worth having a look at uni-directional RNNs for larger datasets and easier deployment. To modify the architecture, see here

##CTC Criterion

To train the system we use the CTCCriterion. This implements warp-CTC, Baidus' fast implementation of CTC that allows us to train on training data that has not been aligned.This means there does not need to be the same number of audio time steps as there are characters (which would make training very difficult!).

Check here to see more about warp-CTC or here for the implementation in Torch7.