Deep Learning solution for digit recognition on natural scene

Question

I am working on a problem, where I want to automatically read the number on images as follows:

As can be seen, the images are quite challenging! Not only are these not connected lines in all cases, but also the contrast differs a lot. My first attempt was using pytesseract after some preprocessing. I also created a StackOverflow post here.

While this approach works fine on an individual image, it is not universal, as it requires too much manual information for the preprocessing. The best solution I have so far, is to iterate over some hyperparameters such as threshold value, filter size of erosion/dilation, etc. However, this is computationally expensive!

Therefore I came to believe, that the solution I am looking for must be deep-learning based. I have two ideas here:

Using a pre-trained network on a similar task
Splitting the input images into separate digits and train / finetune a network myself in an MNIST fashion

Regarding the first approach, I have not found something good yet. Does anyone have an idea for that?

Regarding the second approach, I would need a method first to automatically generate images of the separate digits. I guess this should also be deep-learning-based. Afterward, I could maybe achieve some good results with some data augmentation.

Does anyone have ideas? :)

igrinis · Answer 1 · 2021-02-22T22:53:35.253

Your task is really challenging. I have several ideas, may be it will help you on the way. First, if you get the images right, you can use EasyOCR. It uses a sophisticated algorithm for detecting letters in the image called CRAFT and then recognizes them using CRNN. It provides very fine grained control over symbol detection and recognition parts. For example, after some manual manipulations on the images (greyscaling, contrast enhancing and sharpening) I got

and using the following code

import easyocr
reader = easyocr.Reader(['en']) # need to run only once to load model into memory
reader.readtext(path_to_file, allowlist='0123456789')

the results are 31197432 and 31197396.

Now, for the contrast restoration part, opencv has a tool called CLAHE. If you run following code

img = cv2.imread(fileName)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(gray, (25, 25), 0)
grayscaleImage = gray * ((gray / blurred) > 0.01)  
clahe = cv2.createCLAHE(clipLimit=6.0, tileGridSize=(16,6))
contrasted = clahe.apply(grayscaleImage)

on the original images, you will get which are visually very similarly to those above. I believe that after some cleaning you can get it to be recognizable without too much fiddling with hyperparameters.

And finally, if you want to train your own deep learning OCR, I suggest you use keras-ocr . It uses the same algorithms as EasyOCR, but provides an end-to-end training pipeline to build new OCR model. It has all the necessary steps covered: data sets downloading, data generation, augmentation, training and inferencing.

Take into account that deep learning solutions are very computationally heavy. Good luck!

Thanks a lot for the answer, I will definitely try these ideas! :) — spadel, Feb 23 '21 at 22:47

score 2 · Accepted Answer · answered Feb 26 '21 at 02:12

Regarding to your first approach,

There are two synthetically prepared datasets available:

Text Recognition Data consists from 9M images.
SynthText in the Wild consists from 8M images.

I have used above datasets for text recognition on slab images. Images were quite challenging however now I achieved more than 90% accuracy for that. I have implemented following models to solve this task. These are:

CRAFT for text localization.
Deep Text Recognition for text recognition.

If you are working with kinds of images only, I highly encourage you to try Deep Text Recognition. It is 4 stage framework.

For Transformation, you can choose TPS or None. With TPS, it has showed higher performance. They implemented Spatial Transformer Networks.
On Feature Extraction stage, you will have options: ResNet or VGG
For Sequential Stage, BiLSTM
Attn or CTC for prediction stage.

They achieved best accuracy on TPS-ResNet-BiLSTM-Attn version. You can easily fine tune this network and I hope it can solve your task. The model trained with above mentioned datasets.

Deep Learning solution for digit recognition on natural scene

2 Answers2