PIL preprocessing for tesseract ocr

Question

How do I increase the accuracy of OCR?

I am using pyocr to use call the tesseract binary, wand to convert pdf to Image, and then Pillow to process the Image for OCR.

Have attached all the images

I feel this is the best preprocessing that can be done.

imgf = img.convert('RGB') #to draw a line in between
draw = Draw(imgf)
x,y = imgf.size
eX,eY = 20,800
box =  (x/2 - eX/2, y/2 - eY/2, x/2 + eX/2, y/2 + eY/2)
draw.ellipse(box, fill=0)
del draw


im2 = imgf.filter(ImageFilter.MinFilter(1))  #filter
im2 = im2.filter(ImageFilter.SMOOTH_MORE)
im2 = im2.filter(ImageFilter.SMOOTH_MORE)

for img in req_image:   # OCR
    txt = tool.image_to_string(
        im2,
        lang=lang,
        builder=pyocr.builders.DigitBuilder()
    )
print text

The Image initially is cropped out from PDf, then converted to grayscale, then the above code to process it.

I add the line in between and found it greatly increased the accuracy ( felt like it would work)

1 Accurate 2 Accurate 3 Inaccurate Returns 6563 8 1 4 Greyscale image from pdf

score 1 · Answer 1 · answered Oct 23 '17 at 00:45

1

There is a great api released by Microsoft called cognitive service. You may use that to do image recognization.

https://azure.microsoft.com/en-us/services/cognitive-services/

answered Oct 23 '17 at 00:45

Yuze Ma

357
1
9

Yes, But I think training would be a good idea. Since I have 1000s of PDFs to OCR only for digits and same font. – Darshan Jadav Oct 23 '17 at 07:42

PIL preprocessing for tesseract ocr

1 Answers1