improve the OCR quality of low quality scanned image

Question

After automatic deskew and crop I have the following image:

I need to OCR this image. Right now ABBYY Engine SDK 11 For Linux produces not very well result:

IMerasers - www,raiyirnieti'^C9,co;i,ni                                                          
Clariiis: Jv ocl'ca :PO 9ox 30998, S&M Luke C6y, UT 84":30                                       
Guslomei: Service:                                 952-945-800G or 800-952-3^55                  
Jieaf5ftg: impaired;                               VA                                            
Pharmaaisto:                                       853-364-6331                                  
Medica Pfovic.&s:                                  80 ;j-2i5S-55"',2 o ■ www.rfledica.cori       
^ofricai'or Services:                              86i-7<5-9920                                  
t1 ^edHoaiihca'Q Provicors; 6 77-842420 or                                                       
                                               ; mffiffiF********                               
Sviet iea Be tsvio a rieofift:                                                                  
Mocica Ca-linK frwso ,'ne: 430-962-9*9?

What techniques for automatic image preprocessing can be applied to this image in order to improve the OCR quality? Or it is impossible to increase the OCR quality on this image? Right now I use OpenCV and Leptonica libraries in order to preprocess the images.

UPDATED

This is original image:

Please post your original input image? What format was it? Was it a PDF or JPG or a binary compressed TIFF? If PDF you can process it at a much higher resolution and that would help. — fmw42, Mar 01 '18 at 17:36
Was this scan truly jpg and at this very low quality? If so, I doubt you can improve your result. If the scan had been at a higher resolution or a PDF, then it might be improved. Can you rescan at higher density? — fmw42, Mar 01 '18 at 21:58
@fmw42 thanks! yes, my input file is this jpg. What do you mean under rescan - rescan from this file or from the original source? Everything that I have - is the presented jpg file and this is it. Could you please also explain in more details the benefits of having original scans in PDF? — alexanoid, Mar 02 '18 at 06:43
Rescan from original paper copy at a higher density. When scanning as a PDF, you can set the density when you read the PDF and convert to raster. This means you can get higher quality raster result later from the PDF. Either way, scanning at a higher density is best. Most scanners allow you to set the density when scanning. — fmw42, Mar 02 '18 at 17:29
Thanks for your answer! Unfortunately, I do not have any impact on the scanned documents and just have to work with the documents scanned by someone else. Sometimes I have to extract images from PDF files in order to preprocess them and OCR so I really grateful you for the information about the density that can be set during PDF reading and extraction to raster. I have to check how it can be implemented with Java tools. for example like PdfBox. — alexanoid, Mar 02 '18 at 17:48
Best to experiment with a PDF scan. Sometimes it is better to extract the imbedded image from the PDF. Try `convert -density 300 image.pdf result.png`. Or set the density even higher if that works and see if it is any better. It is best not to save to JPG due to lossy compression. So save to PNG or TIFF. — fmw42, Mar 02 '18 at 18:32

score 2 · Answer 1 · 2018-03-01T16:44:04.563

The image has been binarized at a relatively low resolution and with noise.

You can slightly improve it by

doubling or tripling the resolution (with or without bilinear interpolation, that makes little difference);
smoothing (small Gaussian filter, median...);
binarizing again.

But there is little that you can recover, the damage is done. Most probably, preprocessing will worsen the results.

score 0 · Answer 2 · answered Mar 01 '18 at 17:04

0

As Yves says, the quality of the image is quite low. Nevertheless you should be able to improve your results:

try resizing the image. Some OCR expect letters of specific dimensions
try with other OCR such as tesseract
if you have to read many documents with the same font, you can train the OCR with that font

answered Mar 01 '18 at 17:04

user2518618

1,360
13
32

IMO, resizing will just increase the damage. Some of the characters are irreparably altered. Training with characters obtained in the same conditions is a good idea. – Mar 01 '18 at 17:31
Thanks, I have to figure out if it is possible to train ABBYY Engine SDK for the specific fonts and if so, how it will impact the rest of OCR process.. for other documents... – alexanoid Mar 01 '18 at 18:25

improve the OCR quality of low quality scanned image

2 Answers2