I have not beeen successful in training RNN for Speech to text problem using TensorFlow. I have decided on using pure FFT (i.e. spectrogram) as training data to reproduce the results of method described in Alex Graves, and Navdeep Jaitley, 2014, and coded 3-layer Bidirectional RNN with 300 LSTM units in each. I would like to describe the steps I have followed from pre-processing the audio signal to decoding logits.
Pre-Processing:
Used specgram function from matplotlib.mlab to segment each audio signal in time-domain into frames of 20ms, NFFT = (fs/1000 * 20 samples) length, and to perform windowing cum FFT with an overlap of 7ms.
I initially tried computing power spectrum
ps |fft|^2
, and dB by10 * log10(ps)
, but TensorFlow CTC Loss function produces nan value, and further the optimizer updates all the params to nan apparently, hence I did not proceed further using this.To mention, spectrogram is not normalised as it only makes TensorFlow produce nan values for some reason. Someone please clarify why this is happening. I have a feeling gradients are vanishing. Any recommendations on what initialiser range to use ?
Since different audio files are of varying length, I have padded frames of each batch with max_time as this is required to form a
mini-batch
of shape[max_time,batch,NFFT]
.Since all the target transcriptions are in capital letters, I have only included A-Z, blank space, and some punctuations into list of classes (32 in total), which is used to transform a string target transcription into SparseTensor.
RNN Config:
Forward, and Backward Cells, each LSTM cell with 300 units in each layer using peephole architecture, with forget bias being set to 0 initially to see the performance.
Bidirectional Dynamic RNN with project_size set to
hidden_size 500
.Sequence Length tensor appropriately assigned values for each data in batch with its maximum time length.
Since
tf.nn.bidirectional_dynamic_rnn
does not include the output layersigmoid or softmax
, I perform a linear regression outside whose weights will be of shape =[hidden_size,n_chars]
.I have used loss function
tf.nn.ctc_loss
, which returns huge values like 650 or 700 initially and slides down to maximum of 500 after few hundreds of epochs.Finally CTC beam search decoder is used to find the best path from logits generated by output
softmax or sigmoid
layer.
Now, I do not understand where I am going wrong, but I am just not getting the desired transcription (i.e., weights are not converging to yield targeted results). I request someone to please clarify why this is happening. I have tried to overfit the network with 100 audio clips, but no use. The predicted results are nowhere near the desired transcription.
Thank you for your time, and support.