7

I am new to TensorFlow and to Deep Learning. I am trying to recognize text in naturel scene images. I used to work with an OCR but I would like to use Deep Learning. The text has always the same format : ABC-DEF 88:88.

What I have done is recognize every character/digit. It means that I cropped the image around every character (so each picture gives me 10 characters) to build my training and test set and they build a two conv neural networks. So my training set was a set of characters pictures and the labels were just characters/digits.

But I want to go further. What I would like to do is just to give the full pictures and output the entire text (not one character such as in my previous model).

Thank you in advance for any help.

A. Attia
  • 1,630
  • 3
  • 20
  • 29

1 Answers1

6

The difficulty is that you don't know where the text is. The solution is, given an image, you need to use a sliding window to crop different part of the image, then use a classifier to decide if there are texts in the cropped area. If so, use your character/digit recognizer to tell which characters/digits they really are.

So you need to train another classifer: given a cropped image (the size of cropped images should be slightly larger than that of your text area), decide if there are texts inside.

Just construct training set (positive samples are text areas, negative samples are other areas randomly cropped from the big images) and train it~

soloice
  • 980
  • 8
  • 17
  • Thanks but should this classifier (sliding window) must be a convnet ? The training set must contained multi character text areas or just one character ? – A. Attia Feb 15 '17 at 15:30
  • 1
    A convnet is fine and easy to implement, if you are using TensorFlow, Caffe or some other deep learning framework, but might be slow in the detection phase (because you need to slide the window across the whole image, for each image there are many windows). Other models also works, such as a boosting method with Haar-like features (By Google "haar like feature adaboost cascade" you can find a lot of material on face recognition). – soloice Feb 15 '17 at 15:40
  • @alexattia The training set is better to contain multiple characters. By doing this, you can have a larger window and reduce false positive. If the area is too small, may be some other things will be reported as letters/digits. Say, the algorithm may take some vertical edge as digit "1", which is terrible. – soloice Feb 15 '17 at 15:43
  • Ok I'll try it ! What do you think of this https://matthewearl.github.io/2016/05/06/cnn-anpr/ ? It just contained one convnet instead of two algorithms as you said (detection + classification) – A. Attia Feb 15 '17 at 17:18
  • The project you mentioned above is great and highly relevant! Try to reuse it instead of building a new one from the scratch! – soloice Feb 16 '17 at 02:19