1

I have created a searchable pdf file by running following command on one of my images.

tesseract page.jpg test pdf --oem 1 --psm 5 -l urd

this the image which I have converted to searchable pdf. enter image description here

the image contains Urdu text, but when I am copying it from newly created pdf file and pasting it in any other text editor, this is what I am getting.

GehbFie”

any tesseract OCR and encoding expert here who can solve my issue please, any help will be highly appreciated, thanks in advance.

  • Have you tried LibreOffice Writer, Microsoft WordPad, or Microsoft Word? – lit Oct 04 '18 at 15:55
  • of course, I tried a lot of different editors (sublime, notepad, notepadd++, ms word, WordPad) but the result is the same in every editor, I think there is the encoding problem. – Muhammad Moinuddin Oct 05 '18 at 14:58
  • Very good brother I am happy you are trying to OCR Urdu. Waht is your progress ? Have you tried Google Vision API OCR ( Urdu included ) . https://cloud.google.com/vision/ – MindRoasterMir Feb 04 '19 at 14:52

1 Answers1

1

pdf is the config file name. it needs to come last in the command, after --oem --psm -l etc.

the correct format for the command is following.

tesseract page.jpg test --oem 1 --psm 5 -l urd pdf

I resolved my issue in this way.

  • بہت اعلیٰ بھائی بہت خوشی ہوئی سن کر ۔ ابھی اس کام میں آپ کو کہاں تک کامیابی ہوئی ہے ۔ میں گوگل کی ویژن او سی آر استعمال کر رہا ہوں بہت اعلیٰ ہے ۔ اس جیسی کوئی اور او سی آر اردو کے لیے میں نے نہیں دیکھی ۔ آپ کا تجربہ کیسے ہے ضرور شیئر کریں جزاک اللہ – MindRoasterMir Feb 04 '19 at 14:54