1

(Here is the noise removed image that I am trying to extract text) I am trying to detect text part of an image(jpg file) using Tesseract-OCR and OpenCV in Python. The text part of the imageis Turkish, therefore I am using 'Turkish trained data (tur)' which is in Tesseract-OCR file. I have applied dilation and erosion to remove the noise before using tesseract.

The problem is, eventhough some of the characters in particular areas can be detected, the detection is mostly unsuccesful and fails to detect Turkish characters. Do you know any method or do you have any suggestion to get more success. Here are my codes below :

import pytesseract
from PIL import Image
import cv2

img= cv2.imread('C:\Users\gulsa\Desktop\Tesseract-OCR\alm98_2.jpg')
img = Image.open('alm98_2.jpg')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-
OCR/tesseract'

tex = pytesseract.image_to_string(Image.open('alm98_2.jpg'),lang='tur')
print(tex)

Thank you in advance!

Gülşah Ayhan
  • 77
  • 1
  • 2
  • 10
  • 1
    Have you tried things listed in "Improve quality" section in tesseract FAQ? (https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) – Dmitrii Z. Nov 06 '17 at 14:05
  • 1
    It depends on the image. Is the text handwritten? Is there noise in the image? Is the lighting good? – zindarod Nov 06 '17 at 14:06
  • I have applied binarization, dilation and erosion to remove noise, but the result is the same. The text is not hand written, it is printed and legible, with white clear puntos on black background. – Gülşah Ayhan Nov 06 '17 at 14:21
  • 1
    Can you post your image after you applied all your preprocessing (binarization, dilation etc)? Also by detection do you mean that tesseract doesn't recognize character as turkish (but recognizes as some other char) or it doesn't see anything at all? – Dmitrii Z. Nov 06 '17 at 14:49
  • I have attached my image. It can see the characters but for most characters it does not detect them correctly. It has full of mistakes. – Gülşah Ayhan Nov 06 '17 at 15:20

1 Answers1

1

Here's what i get after using tesseract on your image

HerTürdenErutikyıdeplç'nTıkla!Sımsıkainlemereoyo AnındaCebirıdenIde!Iziemeklçin18YaşındanBüyükoin'ak Zorunludur.HerkamgoridenyüzleroevideoHighDefTvde!High DefTv,abonelik"servistir.Pakelhaîlaliktümvergilerdahilolamk ayda64TLyebtaIedimedig'süreoeherz—ıyyenileneoekîir.Servis ücreti,aboneoldugınuzoperaîöfündüzenleyecegifaîuralar karaliylaveyaönödemelihatlardanTL/Krmikîaridüsülerekîahsil edilecektir.Ipîaliğn:|PTALya24329z-ıgörder.Iptaledilendönem içinücretiadasiyapiin'azXeteriibakiyenizyokayükleme

So far it doesn't seem like a very bad result. Not saying its very good one, but nothing to do with Turkish letters. You can get much better results if you will be able to detect and separate letters which are too close to each other at the moment.

enter image description here

For example for this image i get perfect results (notice better font, more space between chars)

Her Türden Erotik Video Için Tıkla!Sımsicak Binlerce Videoyu

If you're getting a lot of noisy letters which are definitely not in the Turkish alphabet (like fl or î symbols) - you can make a blacklist.

Another option is to iterate through tesseract results character to character and correct them if you can use any heuristic for that.

Edit: TBH when i try to read the text on your image I cannot separate words from the sentence, maybe it is specific of font you're using, but it definitely looks too harsh for both human and machine.

Edit2: Added example image with more space between chars

Dmitrii Z.
  • 2,287
  • 3
  • 19
  • 29
  • It is actually too bad not only for Turkish characters, but also for most of the characters it fails. I will try your suggestions too, thank you so much... – Gülşah Ayhan Nov 06 '17 at 15:48
  • 1
    I added example image with different font which has more space between chars so you can see how output quality improves. – Dmitrii Z. Nov 06 '17 at 16:10
  • Yes, the results are very well this time. Have you put space between chars by hand? I am searching a method to change the font of text in image(i.e putting more space between the characters). – Gülşah Ayhan Nov 06 '17 at 16:42