2

I want to recognize Hindi text from an image using the pytesseract library.

What I tried

The following script recognizes overall text, but I am not getting it into hindi language. It only recognizes typically European / American characters:

# -*- coding: utf-8 -*-
from PIL import Image
import pytesseract


pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
#im = Image.open("/tesserocr/hindisample.png")

#im = Image.open("C:/shubhamprojectwork/ocr/tesseract-python-master/sample1.jpg")
im = Image.open("C:/shubhamprojectwork/ocr/tesseract-python-master/hindisample.png")


text = pytesseract.image_to_string(im, lang = 'hin')

print(len(text))
import codecs
f = codecs.open('bla.txt', encoding='utf-8', mode='w')
f.write(text)
f.close()
file1 = open("bla.txt", encoding='utf-8',mode="r+")
file1.seek(0) 

print ("Output of Readline function is ")
print (file1.readline())

The image for which I wanted text is here

hindisample.png.

It is generating these text

Wfififirifilfiafiiaflmtfimfi

WWfiRWWEIB-‘E

fiafiimfiifimfiafitw

fifiéfififimfiafiamfifiw
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Shubham Sharma
  • 2,763
  • 5
  • 31
  • 46
  • `text = pytesseract.image_to_string(im, lang = 'eng')` means that eng (english) traineddata would be used. If you want to extract hindi - download hindi traineddata from tesseract repository (according to tesseract version you're using) and change 'eng' to 'hin' – Dmitrii Z. May 07 '18 at 07:48
  • many thanks @DmitriiZ. but that is also giving exact same results – Shubham Sharma May 07 '18 at 11:29

1 Answers1

0

You might not have hindi traineddata. Try re-install tesseract library with this command sudo apt-get install tesseract-ocr-hin

Karol Żak
  • 2,158
  • 20
  • 24