I've been trying to train tesseract engine to ocr images that have numbers written using the seven digital font.
And, after searching, it turned out that tesseract won't ocr a segmented font unless the segments are somehow connected.
So, I used erosion, which is an opencv function, on the images to connect the segments. http://www.tutorialspoint.com/java_dip/eroding_dilating.htm
And, after that, I used thresholding to convert the image to binary before handing the image to tesseract (This step is redundant because tesseract internally does image binarization). http://docs.opencv.org/2.4/doc/tutorials/imgproc/threshold/threshold.html
My main problem is that the numbers are written in black on a dark green background. Here are the results
Original image:
Method 1:
After Erosion and binarization (I tried various threshold max values)
Method 2: I tried to use k-means or c-means algorithms but the results were no much better.
Method 3:
I also tried adaptive Gaussian thresholding
Method 5:
Handing the original image to tesseract without any image processing and outputting the result image (Tesseract uses leptonica to do image processing internally).
I also tried various samples instead of this one and tried Gimp to enhance the images using the steps in Gimp image processing, but nothing is working for me. Any suggestions? Thanks!