0

I am trying to program a Keras model for audio transcription using connectionist temporal classification. Using a mostly working framewise classification model and the OCR example, I came up with the model given below, which I want to train on mapping the short-time Fourier transform of German sentences to their phonetic transcription.

My training data actually do have timing information, so I can use it to train a framewise model without CTC. The framewise prediction model, without the CTC loss, works decently (training accuracy 80%, validation accuracy 50%). There is however much more potential training data available without timing information, so I really want to switch a CTC. To test this, I removed the timing from the data, increased the output size by one for the NULL class and added a CTC loss function.

This CTC model does not seem to learn. Overall, the loss is not going down (it went down from 2000 to 180 in a dozen epochs of 80 sentences each, but then it went back up to 430) and the maximum likelihood output it produces creeps around [nh each all of the sentences, which generally have around six words and transcriptions like [foːɐmʔɛsndʰaɪnəhɛndəvaʃn][] are part of the sequence, representing the pause at start and end of the audio.

I find it somewhat difficult to find good explanations of CTC in Keras, so it may be that I did something stupid. Did I mess up the model, mixing up the order of arguments somewhere? Do I need to be much more careful how I train the model, starting maybe with audio snippets with one, two or maybe three sounds each before giving the model complete sentences? In short,

How do I get this CTC model to learn?

connector = inputs
for l in [100, 100, 150]:
    lstmf, lstmb = Bidirectional(
        LSTM(
            units=l,
            dropout=0.1,
            return_sequences=True,
        ), merge_mode=None)(connector)

    connector = keras.layers.Concatenate(axis=-1)([lstmf, lstmb])

output = Dense(
    units=len(dataset.SEGMENTS)+1,
    activation=softmax)(connector)

loss_out = Lambda(
    ctc_lambda_func, output_shape=(1,),
    name='ctc')([output, labels, input_length, label_length])

ctc_model = Model(
    inputs=[inputs, labels, input_length, label_length],
    outputs=[loss_out])
ctc_model.compile(loss={'ctc': lambda y_true, y_pred: y_pred},
                  optimizer=SGD(
                      lr=0.02,
                      decay=1e-6,
                      momentum=0.9,
                      nesterov=True,
                      clipnorm=5))

ctc_lambda_function and the code to generate sequences from the predictions are from the OCR example.

Anaphory
  • 6,045
  • 4
  • 37
  • 68
  • How do you manage to learn without CTC? Normally, I use CTC when I cannot use anything else (outputs are not aligned). Maybe the data you are using is already perprocessed for not using CTC? – Daniel GL Dec 06 '18 at 09:03
  • Does my changed explanation work? – Anaphory Dec 06 '18 at 11:39
  • Yes, thank you. I just wanted to be sure that the inputs were adapted to the question. I do not see any problem in your code with the ctc (I also created my model based on the same example). I train with full sentences (ocr like images) and I have no problem with the training. – Daniel GL Dec 06 '18 at 12:55

1 Answers1

0

It is entirely invisible from the code given here, but elsewhere OP gives links to their Github repository. The error lies actually in the data preparation:

The data are log spectrograms. They are unnormalized, and mostly highly negative. The CTC function picks up on the general distribution of labels much faster than the LSTM layer can adapt its input bias and input weights, so all variation in the input is flattened out. The local minimum of loss might then come from epochs when the marginalized distribution of labels is not yet assumed globally.

The solution to this is to scale the input spectrograms such that they contain both positive and negative values:

for i, file in enumerate(files):
    sg = numpy.load(file.with_suffix(".npy").open("rb"))
    spectrograms[i][:len(sg)] = 2 * (sg-sg.min())/(sg.max()-sg.min()) - 1
Anaphory
  • 6,045
  • 4
  • 37
  • 68