2

After automatic deskew and crop I have the following image:

enter image description here

I need to OCR this image. Right now ABBYY Engine SDK 11 For Linux produces not very well result:

IMerasers - www,raiyirnieti'^C9,co;i,ni                                                          
Clariiis: Jv ocl'ca :PO 9ox 30998, S&M Luke C6y, UT 84":30                                       
Guslomei: Service:                                 952-945-800G or 800-952-3^55                  
Jieaf5ftg: impaired;                               VA                                            
Pharmaaisto:                                       853-364-6331                                  
Medica Pfovic.&s:                                  80 ;j-2i5S-55"',2 o ■ www.rfledica.cori       
^ofricai'or Services:                              86i-7<5-9920                                  
t1 ^edHoaiihca'Q Provicors; 6 77-842420 or                                                       
                                               ; mffiffiF********                               
Sviet iea Be tsvio a rieofift:                                                                  
Mocica Ca-linK frwso ,'ne: 430-962-9*9?    

What techniques for automatic image preprocessing can be applied to this image in order to improve the OCR quality? Or it is impossible to increase the OCR quality on this image? Right now I use OpenCV and Leptonica libraries in order to preprocess the images.

UPDATED

This is original image:

enter image description here

alexanoid
  • 24,051
  • 54
  • 210
  • 410
  • Please post your original input image? What format was it? Was it a PDF or JPG or a binary compressed TIFF? If PDF you can process it at a much higher resolution and that would help. – fmw42 Mar 01 '18 at 17:36
  • @fmw42 I have added the original input image – alexanoid Mar 01 '18 at 18:06
  • Was this scan truly jpg and at this very low quality? If so, I doubt you can improve your result. If the scan had been at a higher resolution or a PDF, then it might be improved. Can you rescan at higher density? – fmw42 Mar 01 '18 at 21:58
  • @fmw42 thanks! yes, my input file is this jpg. What do you mean under rescan - rescan from this file or from the original source? Everything that I have - is the presented jpg file and this is it. Could you please also explain in more details the benefits of having original scans in PDF? – alexanoid Mar 02 '18 at 06:43
  • 1
    Rescan from original paper copy at a higher density. When scanning as a PDF, you can set the density when you read the PDF and convert to raster. This means you can get higher quality raster result later from the PDF. Either way, scanning at a higher density is best. Most scanners allow you to set the density when scanning. – fmw42 Mar 02 '18 at 17:29
  • Thanks for your answer! Unfortunately, I do not have any impact on the scanned documents and just have to work with the documents scanned by someone else. Sometimes I have to extract images from PDF files in order to preprocess them and OCR so I really grateful you for the information about the density that can be set during PDF reading and extraction to raster. I have to check how it can be implemented with Java tools. for example like PdfBox. – alexanoid Mar 02 '18 at 17:48
  • 1
    Best to experiment with a PDF scan. Sometimes it is better to extract the imbedded image from the PDF. Try `convert -density 300 image.pdf result.png`. Or set the density even higher if that works and see if it is any better. It is best not to save to JPG due to lossy compression. So save to PNG or TIFF. – fmw42 Mar 02 '18 at 18:32

2 Answers2

2

The image has been binarized at a relatively low resolution and with noise.

You can slightly improve it by

  • doubling or tripling the resolution (with or without bilinear interpolation, that makes little difference);

  • smoothing (small Gaussian filter, median...);

  • binarizing again.

But there is little that you can recover, the damage is done. Most probably, preprocessing will worsen the results.

enter image description here

0

As Yves says, the quality of the image is quite low. Nevertheless you should be able to improve your results:

  • try resizing the image. Some OCR expect letters of specific dimensions
  • try with other OCR such as tesseract
  • if you have to read many documents with the same font, you can train the OCR with that font
user2518618
  • 1,360
  • 13
  • 32
  • IMO, resizing will just increase the damage. Some of the characters are irreparably altered. Training with characters obtained in the same conditions is a good idea. –  Mar 01 '18 at 17:31
  • Thanks, I have to figure out if it is possible to train ABBYY Engine SDK for the specific fonts and if so, how it will impact the rest of OCR process.. for other documents... – alexanoid Mar 01 '18 at 18:25