I am trying to program a Keras model for audio transcription using connectionist temporal classification. Using a mostly working framewise classification model and the OCR example, I came up with the model given below, which I want to train on mapping the short-time Fourier transform of German sentences to their phonetic transcription.
My training data actually do have timing information, so I can use it to train a framewise model without CTC. The framewise prediction model, without the CTC loss, works decently (training accuracy 80%, validation accuracy 50%). There is however much more potential training data available without timing information, so I really want to switch a CTC. To test this, I removed the timing from the data, increased the output size by one for the NULL class and added a CTC loss function.
This CTC model does not seem to learn. Overall, the loss is not going down (it went down from 2000 to 180 in a dozen epochs of 80 sentences each, but then it went back up to 430) and the maximum likelihood output it produces creeps around [nh
each all of the sentences, which generally have around six words and transcriptions like [foːɐmʔɛsndʰaɪnəhɛndəvaʃn]
– []
are part of the sequence, representing the pause at start and end of the audio.
I find it somewhat difficult to find good explanations of CTC in Keras, so it may be that I did something stupid. Did I mess up the model, mixing up the order of arguments somewhere? Do I need to be much more careful how I train the model, starting maybe with audio snippets with one, two or maybe three sounds each before giving the model complete sentences? In short,
How do I get this CTC model to learn?
connector = inputs
for l in [100, 100, 150]:
lstmf, lstmb = Bidirectional(
LSTM(
units=l,
dropout=0.1,
return_sequences=True,
), merge_mode=None)(connector)
connector = keras.layers.Concatenate(axis=-1)([lstmf, lstmb])
output = Dense(
units=len(dataset.SEGMENTS)+1,
activation=softmax)(connector)
loss_out = Lambda(
ctc_lambda_func, output_shape=(1,),
name='ctc')([output, labels, input_length, label_length])
ctc_model = Model(
inputs=[inputs, labels, input_length, label_length],
outputs=[loss_out])
ctc_model.compile(loss={'ctc': lambda y_true, y_pred: y_pred},
optimizer=SGD(
lr=0.02,
decay=1e-6,
momentum=0.9,
nesterov=True,
clipnorm=5))
ctc_lambda_function
and the code to generate sequences from the predictions are from the OCR example.