I am using Python 2.7, Pytesseract-0.1.7 and Tesseract-ocr 3.05.01 on a Windows machine.
I tried to extract text for Korean and Russian languages, and I am positive that I extracted.
And now I need to compare with the string and string got extracted from the image.
I can't compare the strings and to get the correct result, it just says not match.
Here is my code :
# -*- coding: utf-8 -*-
from PIL import Image
import pytesseract
import argparse
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True, help="path to the image")
args = vars(ap.parse_args())
img = Image.open(args["input"])
img.load()
text = pytesseract.image_to_string(img)
print(text)
text = text.encode('ascii')
print(text)
i = 'Сред. Скорость'
print i
if ( text == i):
print "Match"
else :
print "Not Match"
The image used to extract text is attached.
Now I need a way to match it. And also I need to know the string extracted from pytesseract will be in Unicode or what? and if there is way to convert it into Unicode (like we have option in wordpad for converting character into Unicode)