0

I am using ImageMagick to convert a digitalized PDF file to tiff. I use Tesseract to scan a small part of this document which is a number. My digitalized documents have a poor definition and sometime tesseract doesn't manage to read the right number. For example, it reads : 5550002845 for the number you can see in the picture.

enter image description here

This picture was extracted from the PDF with the following command :

convert -quality 100 -density 300 temp.pdf -depth 8 -colorspace gray +matte +contrast +contrast temp.tiff

Is there anything better I can do to improve the image quality (of the Tesseract detection) ?

Regards

Vincent Roye
  • 2,751
  • 7
  • 33
  • 53

1 Answers1

1

-noise 7 did the trick for this one

Vincent Roye
  • 2,751
  • 7
  • 33
  • 53
  • `-noise 3` is giving me best results so far when converting PDF to TIFF before doing OCR with tesseract, but it's a great tip to do it, without any `-noise`, tesseract really has hard time with grey-ish numbers! (I usually use it this way: `tesseract --psm 1 --oem 1 out.tiff tout` ) – tent Jan 20 '22 at 22:15