3

I have 1000s of PDFs with multiple pages and each PDF has different resolution (based on scanners used to scan them). I want to convert each page of PDF to PNG to pass it to Tesseract for OCR. I used Imagemagick to convert to PNG but have to pass a fixed DPI for all images to get a good readable output. Is there a way I can convert each PDF by preserving the resolution of that PDF too?

For example, if 1.PDF has resolution 622 × 788 and 2.pdf has resolution 792 × 612, I want the exact conversion with same resoultion just a different format(PNG).

The command I am using right now is:

convert -monochrome -density 1200 input.pdf -resize 25% -monochrome -white-threshold 50% -black-threshold -50% output.png

Thanks, pashah

emcconville
  • 23,800
  • 4
  • 50
  • 66
pashah
  • 31
  • 1
  • 4
  • 1
    please add the command you are using to convert as well. only that way it is possible to obtain an answer that actually explains something – rll Jun 24 '16 at 18:24
  • sorry..have updated the post with command – pashah Jun 24 '16 at 18:40

1 Answers1

0

Perhaps read the geometry of the first page, then resize all pages to match?

SIZE=$(identify -format '%g' input.pdf)
convert  -monochrome \
         -density 1200 \
         -resize $SIZE \
         -white-threshold 50% \
         -black-threshold -50% \
         -append \
         output.png
emcconville
  • 23,800
  • 4
  • 50
  • 66
  • Thanks @emcconville. However, this does not preserve the resolution. The output png image is degraded. – pashah Jun 27 '16 at 21:29