text is being changed when i do copy it from searchable pdf file (created with tesseract command) and paste it in notepad

Question

I have created a searchable pdf file by running following command on one of my images.

tesseract page.jpg test pdf --oem 1 --psm 5 -l urd

this the image which I have converted to searchable pdf.

the image contains Urdu text, but when I am copying it from newly created pdf file and pasting it in any other text editor, this is what I am getting.

GehbFie”

any tesseract OCR and encoding expert here who can solve my issue please, any help will be highly appreciated, thanks in advance.

Have you tried LibreOffice Writer, Microsoft WordPad, or Microsoft Word? — lit, Oct 04 '18 at 15:55
of course, I tried a lot of different editors (sublime, notepad, notepadd++, ms word, WordPad) but the result is the same in every editor, I think there is the encoding problem. — Muhammad Moinuddin, Oct 05 '18 at 14:58
Very good brother I am happy you are trying to OCR Urdu. Waht is your progress ? Have you tried Google Vision API OCR ( Urdu included ) . https://cloud.google.com/vision/ — MindRoasterMir, Feb 04 '19 at 14:52

score 1 · Accepted Answer · answered Oct 16 '18 at 15:40

1

pdf is the config file name. it needs to come last in the command, after --oem --psm -l etc.

the correct format for the command is following.

tesseract page.jpg test --oem 1 --psm 5 -l urd pdf

I resolved my issue in this way.

answered Oct 16 '18 at 15:40

Muhammad Moinuddin

135
1
11

بہت اعلیٰ بھائی بہت خوشی ہوئی سن کر ۔ ابھی اس کام میں آپ کو کہاں تک کامیابی ہوئی ہے ۔ میں گوگل کی ویژن او سی آر استعمال کر رہا ہوں بہت اعلیٰ ہے ۔ اس جیسی کوئی اور او سی آر اردو کے لیے میں نے نہیں دیکھی ۔ آپ کا تجربہ کیسے ہے ضرور شیئر کریں جزاک اللہ – MindRoasterMir Feb 04 '19 at 14:54

text is being changed when i do copy it from searchable pdf file (created with tesseract command) and paste it in notepad

1 Answers1