Pytesseract foreign language extraction using python

Question

I am using Python 2.7, Pytesseract-0.1.7 and Tesseract-ocr 3.05.01 on a Windows machine.

I tried to extract text for Korean and Russian languages, and I am positive that I extracted.

And now I need to compare with the string and string got extracted from the image.

I can't compare the strings and to get the correct result, it just says not match.

Here is my code :

# -*- coding: utf-8 -*-
from PIL import Image
import pytesseract
import argparse
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True, help="path to the image")
args = vars(ap.parse_args())
img = Image.open(args["input"])
img.load()
text = pytesseract.image_to_string(img)
print(text)
text = text.encode('ascii')
print(text)
i = 'Сред. Скорость'
print i
if ( text == i):
    print "Match"
else :
    print "Not Match"

The image used to extract text is attached.

Now I need a way to match it. And also I need to know the string extracted from pytesseract will be in Unicode or what? and if there is way to convert it into Unicode (like we have option in wordpad for converting character into Unicode)

I don't know whether I did something wrong in extracting Russian language. Like do I need to mention which kind of text I am extracting ? — Deepan Raj, Jun 22 '17 at 06:35
There is no image. If you include `from __future__ import print_function` at the top of your file (below the `coding` line), that will help you consistently use the `print` function. Now `print(text)` is a `print` statement followed by `text` with non-functional parenthesis. — Anthon, Jun 22 '17 at 06:37

Marjan Moderc · Accepted Answer · 2018-05-21T08:20:08.937

23

You are using Tesseract with a language other than English, so first of all, make sure, that you have learning dataset for your language installed, as it is shown here (linux instructions only).

Secondly, I strongly suggest you to switch to Python 3 if you are working with non ascii langugages (as I do, as a slovenian). Python 3 works with Unicode out of the box, so it really saves you tons of pain with encoding and decoding strings...

# python3 obligatory !!!    
from PIL import Image
import pytesseract

img = Image.open("T9esw.png")
img.load()
text = pytesseract.image_to_string(img, lang="rus")  #Specify language to look after!
print(text)
i = 'Сред. Скорость'
print(i)
if (text == i):
    print("Match")
else :
    print("Not Match")

Which outputs:

Фред скорасть
Сред. Скорость
Not Match

This means the words didn't quite match, but still, considering the minimal coding effort and awful quality of input image, it think that the performance is quite amazing. Anyways, the example shows that encoding and decoding should no longer be a problem.

edited May 21 '18 at 08:20

answered Jun 22 '17 at 07:58

Marjan Moderc

2,747
23
44

This is what I am facing right now. This is not matching correctly. I am having doubt that pytesseract is not doing OCR properly or we need to instruct pytesseract to extract Russian text. – Deepan Raj Jun 22 '17 at 08:55
Well, this is something completely irrelevant to your original encoding-decoding issue. If library is written only to recognize ascii characters and will force those characters, none of the python-level encoding and decoding will help. – Marjan Moderc Jun 22 '17 at 09:14
I found that the language pack is not being considered at all in the code. so it just consider the Russian text as English and tries to match with it. And I tried with mentioning the language in the code like " text = (pytesseract.image_to_string(img), lang='rus') ". But it throws error like "UnicodeEncodeError: 'charmap' codec can't encode characters In position 0-3: character maps to " – Deepan Raj Jun 22 '17 at 13:47
This just reveals that haven't switched to python 3. Above you can find refined answer that solve your problem. As said previously, make sure that you use python3. it very easy to switch to with anaconda or other virtual env. – Marjan Moderc Jun 22 '17 at 14:33
I switched to python 3, and i am getting the same problem. Like the text doesnt match up. – Deepan Raj Jun 22 '17 at 17:13
What is your question again? Text not matching up and unicodedecode error are two totally different things... I dont think that library performance itself (= text not matching up) is a subject to be discussed via stackoverflow... have you tried with any other images? – Marjan Moderc Jun 22 '17 at 18:28
I tried with different images and for different languages. For English there is no problem. but for other language still I am facing the above problem. – Deepan Raj Jun 23 '17 at 05:26
marjan moderc - can you please mention which python version and tesseract-ocr version you used. i mean i tried with python 3.6.1. and tesseract-0cr 3.05.01... I am getting extracted text like this "???? ???????" and i have mentioned the language rus. – Deepan Raj Jun 23 '17 at 07:27
It should work with the versions you provided. Are you printing this to command line? Have you installed russian language pack of tesseract( just saying rus in python is not enough) Which IDE are you using? – Marjan Moderc Jun 24 '17 at 10:13
ya i have included the tessdata in tesseract ocr. I tried to print in the command prompt and also tried to store the string in notepad and excel to, but in everything results the same. Since i am using python 3.6 i can unicode problem is not appearing but it prints only question marks instead of characters. – Deepan Raj Jun 25 '17 at 14:53
Works here on Linux (Ubuntu 16.04), Python 3.5.2, pytesseract 0.1.7, tesseract 3.04.01. Printing in system terminal and in Pycharm (IDE) terminal, works just fine... – Marjan Moderc Jun 26 '17 at 07:25
moderc your suggestion worked. I was working in window machine. When i tried in ubuntu it is working perfectly. I need to figure what is wrong in windows os. Thank you. Please let me know if you tried in windows os. – Deepan Raj Jun 26 '17 at 14:13
2

It all has to to with the system encoding. Tesseract computes everything ok, but windows can't display it properly. Have a look what is your window's IDE encoding and try to set it to utf8-ish and you should be good to go on Windows too! – Marjan Moderc Jun 26 '17 at 15:18
Ok. Will try for that. Thank you – Deepan Raj Jun 27 '17 at 12:58

Pytesseract foreign language extraction using python

1 Answers1

Linked