I am trying to build a classifier using TensorFlow to recognize a specific part of a text in a neural net. I was inspired by the SVHN project to recognize the clock in soccer pictures. During the entire project, I am only focused on the clock only. I add a picture to be more explicit about my project.
The first thing that I did was to crop the pictures around the clock and predict the digits (it could be 3 or 4 digits 9:38 or 11:34 for instance). It worked well, I have a good accuracy (>90% on test set) with 20k pictures for my training set.
Now I would like to do something more complicated and I think a neural net should be able to do it but I am not sure. So I don't crop exactly the clock but I crop the whole scoreboard (with team names, etc) and I still want to predict the clock.
I tried with 20k pictures and 40k pictures for my training. In both cases, I only have 70% of accuracy on the test set. The clock is approximately always at the same position in the pictures (at the top of the scoreboard).
I don't understand why the accuracy is so low. If someone has a clue, it would be really helpful. Thank you very much for any help.
Specifications:
image size : 32x32
numbers of labels : 11 (0-9 + blank)
model :
7-layer CNN.
C1: convolutional layer, batch_size x 28 x 28 x 16, convolution size: 5 x 5 x 1 x 16
S2: sub-sampling layer, batch_size x 14 x 14 x 16
C3: convolutional layer, batch_size x 10 x 10 x 32, convolution size: 5 x 5 x 16 x 32
S4: sub-sampling layer, batch_size x 5 x 5 x 32
C5: convolutional layer, batch_size x 1 x 1 x 64, convolution size: 5 x 5 x 32 x 64
Dropout
F6: fully-connected layer, weight size: 64 x 16
Output layer, weight size: 16 x 11