ImageMagick best improvements for number readibility (with Tesseract)

Question

I am using ImageMagick to convert a digitalized PDF file to tiff. I use Tesseract to scan a small part of this document which is a number. My digitalized documents have a poor definition and sometime tesseract doesn't manage to read the right number. For example, it reads : 5550002845 for the number you can see in the picture.

enter image description here

This picture was extracted from the PDF with the following command :

convert -quality 100 -density 300 temp.pdf -depth 8 -colorspace gray +matte +contrast +contrast temp.tiff

Is there anything better I can do to improve the image quality (of the Tesseract detection) ?

Regards

score 1 · Answer 1 · answered Dec 20 '13 at 09:48

1

-noise 7 did the trick for this one

answered Dec 20 '13 at 09:48

Vincent Roye

2,751
7
33
53

`-noise 3` is giving me best results so far when converting PDF to TIFF before doing OCR with tesseract, but it's a great tip to do it, without any `-noise`, tesseract really has hard time with grey-ish numbers! (I usually use it this way: `tesseract --psm 1 --oem 1 out.tiff tout` ) – tent Jan 20 '22 at 22:15

ImageMagick best improvements for number readibility (with Tesseract)

1 Answers1