I have tried to read text from image of receipt using pytesseract. But a result text have a lot weird characters and it really looks awful. There is my code which i used to manipulate image:
import sys
from PIL import Image
import cv2 as cv
import numpy as np
import pytesseract
def manipulate_image(img):
img = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
kernel = np.ones((1,1), dtype = "uint8")
img = cv.erode(img, kernel, iterations = 1)
img = cv.threshold(img, 0, 255,
cv.THRESH_BINARY | cv.THRESH_OTSU)[1]
img = cv.medianBlur(img, 3)
return img
if len(sys.argv) > 2:
print("Please provide only name of image.")
elif len(sys.argv) == 2:
img = cv.imread(sys.argv[1])
img = manipulate_image(img)
cv.imwrite("test.png", img)
text = pytesseract.image_to_string(img)
print text.encode('utf8')
else:
print("Please provide name of image.")
there is my test receipt image: https://i.stack.imgur.com/mKrls.jpg and there is output image after manupulate: https://i.stack.imgur.com/ep5sH.jpg and there is text result:
""'9vco4v‘l7
0 .Vt3t00N 00t300N BUNUUS
SKLEP PUU POPUGOH|
UL. JHGIELLUNSKA 25, 70-364 SZCZ[C|N
TEL. 91 4841-20-58
N|P: 955—150-21-B2
dn.19r03.05 Uydr.8534
PARAGON FISKALNY
CIHSTKH 17 0,3 ¥ 16,30 = 4.89 B
Sp.0p.B 4,89 PTU B= 8,00% 0,35
Razem PTU 0,35
ZOP{HCUNU GUTUNKQ PLN
RESZTA PLN
0025/1373 H0103 0N|0 H.
15F H9HF[B9416} 13fl02D6k0[20D4334C
7?? BW 140
Any idea how to perform it in better way to get nicer results?