1

I am training a handwriting recognition model of this architecture:

{
"network": [
{
"layer_type": "l2_normalize"
},
{
"layer_type": "conv2d",
"num_filters": 16,
"kernel_size": 5,
"stride": 1,
"padding": "same"
},
{
"layer_type": "max_pool2d",
"pool_size": 2,
"stride": 2,
"padding": "same"
},
{
"layer_type": "l2_normalize"
},
{
"layer_type": "dropout",
"keep_prob": 0.5
},
{
"layer_type": "conv2d",
"num_filters": 32,
"kernel_size": 5,
"stride": 1,
"padding": "same"
},
{
"layer_type": "max_pool2d",
"pool_size": 2,
"stride": 2,
"padding": "same"
},
{
"layer_type": "l2_normalize"
},
{
"layer_type": "dropout",
"keep_prob": 0.5
},
{
"layer_type": "conv2d",
"num_filters": 64,
"kernel_size": 5,
"stride": 1,
"padding": "same"
},
{
"layer_type": "max_pool2d",
"pool_size": 2,
"stride": 2,
"padding": "same"
},
{
"layer_type": "l2_normalize"
},
{
"layer_type": "dropout",
"keep_prob": 0.5
},
{
"layer_type": "conv2d",
"num_filters": 128,
"kernel_size": 5,
"stride": 1,
"padding": "same"
},
{
"layer_type": "max_pool2d",
"pool_size": 2,
"stride": 2,
"padding": "same"
},
{
"layer_type": "l2_normalize"
},
{
"layer_type": "dropout",
"keep_prob": 0.5
},
{
"layer_type": "conv2d",
"num_filters": 256,
"kernel_size": 5,
"stride": 1,
"padding": "same"
},
{
"layer_type": "max_pool2d",
"pool_size": 2,
"stride": 2,
"padding": "same"
},
{
"layer_type": "l2_normalize"
},
{
"layer_type": "dropout",
"keep_prob": 0.5
},
{
"layer_type": "collapse_to_rnn_dims"
},
{
"layer_type": "birnn",
"num_hidden": 128,
"cell_type": "LSTM",
"activation": "tanh"
}
],
"output_layer": "ctc_decoder"
}

The training ctc loss suddenly drops on the first training epoch but it plateaus fluctuates for the rest of the epochs. The label error rate not only fluctuates but it doesn't really seem to go lower.

enter image description here

I should mention that the sequence length of each sample is really close to the length of the longest ground truth (i.e. from 1024, it becomes 32 by the time it enters the ctc_loss which is close to the longest ground truth length of 21).

As for the preprocessing of images, I made sure that they the aspect ratio is maintained when I resize it, and right padded the image to make it a square so that all the images will have the width and the handwritten words will be on the left. I also inverted the color of the images such that the handwritten characters have the highest pixel value (255) and the background while the rest have the lowest pixel value (0).

sample preprocessed image

The predictions are something like this. A random set of strings on the first part then a bunch of zeroes at the end (which is probably expected because of the padding).

INFO:tensorflow:outputs = [[59 45 59 45 59 55 59 55 59 45 59 55 59 55 59 55 45 59  8 59 55 45 55  8
  45  8 45 59 45  8 59  8 45 59 45  8 45 19 55 45 55 45 55 59 45 59 45  8
  45  8 45 55  8 45  8 45 59 45 55 59 55 59  8 55 59  8 45  8 45  8 59  8
  59 45 59 45 59 45 59 45 59 45 59 45 19 45 55 45 22 45 55 45 55  8 45  8
  59 45 59 45 59 45 59 55  8 45 59 45 59 45 59 45 19 45 59 45 19 59 55 24
   4 52 54 55]]

Here's how I collapse cnn outputs to rnn dims:

def collapse_to_rnn_dims(inputs):
    batch_size, height, width, num_channels = inputs.get_shape().as_list()
    if batch_size is None:
        batch_size = -1
    time_major_inputs = tf.transpose(inputs, (2, 0, 1, 3))
    reshaped_time_major_inputs = tf.reshape(time_major_inputs,
                                            [width, batch_size, height * num_channels]
                                            )
    batch_major_inputs = tf.transpose(reshaped_time_major_inputs, (1, 0, 2))
    return batch_major_inputs

And here's how I collapse rnn to ctc dims:

def convert_to_ctc_dims(inputs, num_classes, num_steps, num_outputs):
    outputs = tf.reshape(inputs, [-1, num_outputs])
    logits = slim.fully_connected(outputs, num_classes,
                                  weights_initializer=slim.xavier_initializer())
    logits = slim.fully_connected(logits, num_classes,
                                  weights_initializer=slim.xavier_initializer())
    logits = tf.reshape(logits, [num_steps, -1, num_classes])
    return logits
Rocket Pingu
  • 621
  • 9
  • 26
  • could you show some samples of ground truth vs. recognized texts? Does the output first show random strings and then, after plateauing, always show the empty string? Do you normalize the input, i.e. map the grayvalue-distribution to mean=0 and std=1? Have you tried a larger learning rate for Adam (I use 0.1)? – Harry Mar 19 '18 at 17:25
  • I apologize for the very late reply. I'm working on another module of the project. I'll get back to this in a few days. – Rocket Pingu Apr 04 '18 at 02:07
  • Finally done with the other module I was working on. I already updated the problem. – Rocket Pingu Apr 30 '18 at 00:39
  • I noticed one more thing. If I keep on feeding one and same image every step, the loss goes down but not the label error rate. – Rocket Pingu May 02 '18 at 09:30
  • can you please map the output labels to characters, such that it is easier to see what the NN outputs. Maybe the components of the NN are connected the wrong way. I don't know your json-kind of architecture specification you use. However, it is important to connect the components the right way, e.g. in my HTR system the RNN outputs a tensor of shape BxTxC while the CTC needs TxBxC as input. So I have to transpose the tensor between RNN and CTC. – Harry May 02 '18 at 11:12
  • I just use the json format to build the architecture easily. It's p much the same arguments you pass to the tensorflow layers (i.e. conv2d requires `kernel size`, `stride`, `num_filters`, and whatnot). I'll upload a small reproducible example. – Rocket Pingu May 02 '18 at 11:20
  • I'm using a TF estimator btw. I don't really know how to map the outputs while the training is ongoing. – Rocket Pingu May 02 '18 at 11:33
  • I've tried to reproduce the example as best as I can with this: https://github.com/selcouthlyBlue/CNN-LSTM-CTC-HIGH-LOSS However, this one gets into a `no valid path error` for some reason. – Rocket Pingu May 02 '18 at 11:46
  • Already fixed the `no valid path error`. It's the minimal code I have to reproduce the issue. – Rocket Pingu May 03 '18 at 01:27
  • I've made a clearer question here: https://stackoverflow.com/questions/50148945/not-converging-ctc-loss-and-fluctuating-label-error-rate-edit-distance-on-sing – Rocket Pingu May 03 '18 at 07:11

0 Answers0