Not getting hindi text from image

Question

I want to recognize Hindi text from an image using the pytesseract library.

What I tried

The following script recognizes overall text, but I am not getting it into hindi language. It only recognizes typically European / American characters:

# -*- coding: utf-8 -*-
from PIL import Image
import pytesseract


pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
#im = Image.open("/tesserocr/hindisample.png")

#im = Image.open("C:/shubhamprojectwork/ocr/tesseract-python-master/sample1.jpg")
im = Image.open("C:/shubhamprojectwork/ocr/tesseract-python-master/hindisample.png")


text = pytesseract.image_to_string(im, lang = 'hin')

print(len(text))
import codecs
f = codecs.open('bla.txt', encoding='utf-8', mode='w')
f.write(text)
f.close()
file1 = open("bla.txt", encoding='utf-8',mode="r+")
file1.seek(0) 

print ("Output of Readline function is ")
print (file1.readline())

The image for which I wanted text is here

.

It is generating these text

Wﬁﬁﬁriﬁlﬁaﬁiaﬂmtﬁmﬁ

WWﬁRWWEIB-‘E

ﬁaﬁimﬁiﬁmﬁaﬁtw

ﬁﬁéﬁﬁﬁmﬁaﬁamﬁﬁw

`text = pytesseract.image_to_string(im, lang = 'eng')` means that eng (english) traineddata would be used. If you want to extract hindi - download hindi traineddata from tesseract repository (according to tesseract version you're using) and change 'eng' to 'hin' — Dmitrii Z., May 07 '18 at 07:48
many thanks @DmitriiZ. but that is also giving exact same results — Shubham Sharma, May 07 '18 at 11:29

score 0 · Answer 1 · edited Aug 01 '20 at 09:10

0

You might not have hindi traineddata. Try re-install tesseract library with this command sudo apt-get install tesseract-ocr-hin

edited Aug 01 '20 at 09:10

Karol Żak

2,158
20
24

answered Jul 30 '20 at 18:57

harish v

1

Not getting hindi text from image

What I tried

1 Answers1