OCR JPEG file to text

Question

I am trying to convert the attached OCR JPEG file to text. When I use pytesseract or tesseract, I am seeing diacritics because of which my output contains a lot of junk characters. Also, conversion of jpeg to text is not working.

I tried to read from the image file, extract text, and print using keystrokes. The output is not as expected.

The code is as follows:

image=Image.open('8001.jpg')
text = image_to_string(image, lang='eng')
keyboard.write(text)

I am getting some unwanted characters like these:

>) ) 7? ) 7 0 Daybreak: appeared. Ihe mowing miosls ourvounded us, bub Urey 2001 cleared ch J Wea

> pm 0. 0 ) ) aeaboul lo examine the hull, which formed on deely a kind of horizontal 2

fatfoun, w fen a J felt ils op nel, kicking the resounding plate. “Open,

) me " 57 gradually sinking. Oh! confound i! cried Nod

0 Q yi you inhoapitable zasealy!

Says Pp iy ui

0 0 cide, came from the interior of the Boal. One iton plate was moved, a men appeared, ullered

Link to my file https://www.plustransfer.com/download.php?id=2671aa1153f0402615141fd7f2f1011a — Eswar Rajan Subramanian, Jun 12 '19 at 13:17
Is your only option JPEG file ? JPEG is probably the worst format for text if the image resolution is small. If you can, try to get a image that has less compression and a larger resolution. — Oddmar Dam, Jun 12 '19 at 16:19
Yes. As you mentioned, the only available option is JPEG. I don't have any other option. Can i try converting the same image to png or any other rich format? Is it possible? — Eswar Rajan Subramanian, Jun 12 '19 at 19:11
https://www.dropbox.com/s/knwzu0wupkahff1/8001.jpg?dl=0 This is the actual file and this is the picture quality. Any way possible to extract the data ? — Eswar Rajan Subramanian, Jun 12 '19 at 19:15
Eswar. No. converting JPG to something else is still going to give you the artifacts that are in jpeg files — Oddmar Dam, Jun 12 '19 at 20:33
Hello Oddmar Dam. Oh!!!. I don't have any other format then. This is the only available format. — Eswar Rajan Subramanian, Jun 13 '19 at 02:55
Did you follow this example ? https://www.pyimagesearch.com/2018/09/17/opencv-ocr-and-text-recognition-with-tesseract/ — Oddmar Dam, Jun 13 '19 at 07:13
I tried the same. But no luck. It is able to find only a very limited pattern in that file and not full file. — Eswar Rajan Subramanian, Jun 13 '19 at 07:49
I would say you need some deep learning with a lot of training. The problem with they image, is that it's a very difficult font to OCR and the jpg quality is to low. — Oddmar Dam, Jun 13 '19 at 09:35

OCR JPEG file to text

0 Answers0