Pytesseract reading receipt

Question

I have tried to read text from image of receipt using pytesseract. But a result text have a lot weird characters and it really looks awful. There is my code which i used to manipulate image:

import sys
from PIL import Image
import cv2 as cv
import numpy as np
import pytesseract


def manipulate_image(img):
    img =  cv.cvtColor(img, cv.COLOR_BGR2GRAY)
    kernel = np.ones((1,1), dtype = "uint8") 
    img = cv.erode(img, kernel, iterations = 1) 
    img = cv.threshold(img, 0, 255,
        cv.THRESH_BINARY | cv.THRESH_OTSU)[1]
    img = cv.medianBlur(img, 3)
    return img


if len(sys.argv) > 2:
    print("Please provide only name of image.")
elif len(sys.argv) == 2:
    img = cv.imread(sys.argv[1])

    img = manipulate_image(img)
    cv.imwrite("test.png", img)
    text = pytesseract.image_to_string(img)
    print text.encode('utf8')
else:
    print("Please provide name of image.")

there is my test receipt image: https://i.stack.imgur.com/mKrls.jpg and there is output image after manupulate: https://i.stack.imgur.com/ep5sH.jpg and there is text result:

""'9vco4v‘l7

0 .Vt3t00N 00t300N BUNUUS



SKLEP PUU POPUGOH|
UL. JHGIELLUNSKA 25, 70-364 SZCZ[C|N
TEL. 91 4841-20-58
N|P: 955—150-21-B2
dn.19r03.05 Uydr.8534
PARAGON FISKALNY
CIHSTKH 17 0,3 ¥ 16,30 = 4.89 B
Sp.0p.B 4,89 PTU B= 8,00% 0,35
Razem PTU 0,35
ZOP{HCUNU GUTUNKQ PLN
RESZTA PLN
0025/1373 H0103 0N|0 H.
15F H9HF[B9416} 13ﬂ02D6k0[20D4334C
7?? BW 140

Any idea how to perform it in better way to get nicer results?

You can test and try some image processing with opencv like eroding, dilating or run `textcleaner` on the images from ImageMagick library and then try tesseract. — Arkistarvh Kltzuonstev, Mar 13 '19 at 10:54
I have tried enroding but it hmmm not always works. I will look at textcleaner. Thanks! — A. Blicharski, Mar 13 '19 at 10:58

score 0 · Accepted Answer · answered Mar 13 '19 at 11:01

Applying simple thresholding will not be enough for pyTesseract to properly detect the characters. There is much more preprocessing that can be done to drastically improve your results, such as:

using Tesseract V4, where deep learning is implemented
segmenting characters
using only the part of the receipt where the text is through edge detection
perspective transform to straighten out the text

These are somewhat lengthy topics to write all in one answer, but you can check out some articles on pyImageSearch, where this is talked about in much more depth:

https://www.pyimagesearch.com/2014/09/01/build-kick-ass-mobile-document-scanner-just-5-minutes/ https://www.pyimagesearch.com/2018/09/17/opencv-ocr-and-text-recognition-with-tesseract/

Thank you for that links. I was struggling which transformations do and these links could be really helpful. — A. Blicharski, Mar 13 '19 at 11:10

Pytesseract reading receipt

1 Answers1