RNN for End-End Speech Recognition using TensorFlow

Question

I have not beeen successful in training RNN for Speech to text problem using TensorFlow. I have decided on using pure FFT (i.e. spectrogram) as training data to reproduce the results of method described in Alex Graves, and Navdeep Jaitley, 2014, and coded 3-layer Bidirectional RNN with 300 LSTM units in each. I would like to describe the steps I have followed from pre-processing the audio signal to decoding logits.

Pre-Processing:

Used specgram function from matplotlib.mlab to segment each audio signal in time-domain into frames of 20ms, NFFT = (fs/1000 * 20 samples) length, and to perform windowing cum FFT with an overlap of 7ms.
I initially tried computing power spectrum ps |fft|^2, and dB by 10 * log10(ps), but TensorFlow CTC Loss function produces nan value, and further the optimizer updates all the params to nan apparently, hence I did not proceed further using this.
To mention, spectrogram is not normalised as it only makes TensorFlow produce nan values for some reason. Someone please clarify why this is happening. I have a feeling gradients are vanishing. Any recommendations on what initialiser range to use ?
Since different audio files are of varying length, I have padded frames of each batch with max_time as this is required to form a mini-batch of shape [max_time,batch,NFFT].
Since all the target transcriptions are in capital letters, I have only included A-Z, blank space, and some punctuations into list of classes (32 in total), which is used to transform a string target transcription into SparseTensor.

RNN Config:

Forward, and Backward Cells, each LSTM cell with 300 units in each layer using peephole architecture, with forget bias being set to 0 initially to see the performance.
Bidirectional Dynamic RNN with project_size set to hidden_size 500.
Sequence Length tensor appropriately assigned values for each data in batch with its maximum time length.
Since tf.nn.bidirectional_dynamic_rnn does not include the output layer sigmoid or softmax, I perform a linear regression outside whose weights will be of shape = [hidden_size,n_chars].
I have used loss function tf.nn.ctc_loss, which returns huge values like 650 or 700 initially and slides down to maximum of 500 after few hundreds of epochs.
Finally CTC beam search decoder is used to find the best path from logits generated by output softmax or sigmoid layer.

Now, I do not understand where I am going wrong, but I am just not getting the desired transcription (i.e., weights are not converging to yield targeted results). I request someone to please clarify why this is happening. I have tried to overfit the network with 100 audio clips, but no use. The predicted results are nowhere near the desired transcription.

Thank you for your time, and support.

Nikolay Shmyrev · Answer 1 · 2016-07-16T22:04:21.033

1

If you want to try this it's better to reproduce Eesen.

If you still want tensorflow, you can find complete at tensorflow CTC example.

edited Jul 16 '16 at 22:04

answered Jul 15 '16 at 07:52

Nikolay Shmyrev

24,897
5
43
87

Thanks a lot for providing the link to code. I will look into it straight away. – VM_AI Jul 16 '16 at 22:08

score 1 · Answer 2 · answered Jul 16 '16 at 00:06

1

There are a lot of parameters to play with. I've found the momentum optimizer with high momentum (greater than 0.99) tends to work well. Others have found that batching causes problems and that one should use smaller batch sizes.

Either way, convergence for these models takes a long time.

answered Jul 16 '16 at 00:06

Eugene Brevdo

899
7
8

I'm using mini-batch size: 100 and have a total of 47K audio clips, I need to train. In this case, what mini-batch size you would recommend though? – VM_AI Jul 16 '16 at 18:47
You'll have to experiment. Some folks swear that you should use batch_size 1, otherwise the gradients get confused. Others use closer to 64-128. In general, powers of two for batch size lead to slightly faster calculations (but that's not really relevant to your question). Since you don't have a lot of data, I would suggest the Momentum or RMSProp optimizer where you vary the momentum parameter between 0.9, 0.95, 0.99, and try batch_sizes of 1, 8, 64, and 128. – Eugene Brevdo Jul 18 '16 at 17:35

score 0 · Answer 3 · answered Aug 11 '16 at 18:21

0

You can see a working example (using a Toy Dataset) for TensorFlow at: https://github.com/igormq/ctc_tensorflow_example.

Feel free to use!

answered Aug 11 '16 at 18:21

Igor Macedo Quintanilha

153
1
1
7

I believe we have almost the same code, but it just takes too long to converge though especially with 47K data. Link: http://stackoverflow.com/questions/38880176/tensorflow-saver-restore-not-restoring-all-parameters and please see whether you can answer this question as well. Thank you – VM_AI Aug 11 '16 at 18:39

RNN for End-End Speech Recognition using TensorFlow

3 Answers3