5

I have a bunch of images that I want to convert into a single PDF, the images are primarily images of text (similar to scanned images of a textbook). The image files are extremely large, I have no need for the amount of resolution that they offer.

So first, as a base file, I did a simple conversion of 26 of these "pages" to a single pdf, and the total filesize was 46MB for 26 pages. Viewing in page width mode resulted in a scale of 16% of the original image.

convert *.png kapittel1.pdf

The quality of the PDF pages was perfect, they were just too large. So I figure since 16% of the image is more than adequate for viewing the entire width of the page on my screen, I could reduce the image sizes to 20% of their original values and still maintain the same image quality. The quality of the images is visibly less than before reducing the size.

convert -resize 20% -quality 100% *.png 20percent.pdf

I believe I'm going to need to start looking into filters, but before I potentially waste my time converting using all of the filters then comparing to find the one I want to use, is there a better way to just reduce the size, maintain quality, then convert to PDF? I don't see why I would be losing pixels here.

Edit

I tried with -scale instead of -resize but am really not seeing a difference in the output. It pretty much seems that once I go below 40% I start losing pixel data.

Jens Bodal
  • 1,707
  • 1
  • 22
  • 32
  • 2
    In the future, try to scan text at 1:1 in grayscale at 300DPI (if you want to OCR it); that gets me the best results. I've found that it always works best, afterward, to use Adobe to downsample and compress the images (via document processing) and then OCR it using "Clearscan," which increases the quality of the font. I know that doesn't address ImageMagick exactly, but it's become my default workflow for scanning documents. – Shawn Patrick Rice Oct 12 '14 at 19:02
  • Thank you for the suggestion. At the moment I pretty much only have the image files, and worst case scenario I just have to deal with the extremely large PDF files (~20x46MB). I'm guessing the Adobe stuff you're referring to requires Adobe Acrobat, which I don't have immediate access to. Though I think similar to what you had said, the image files I have are extremely high quality and all of the data should be there, I just want them to be much smaller but of the same quality. – Jens Bodal Oct 12 '14 at 19:08
  • 1
    You're right. I did mean Acrobat. While there are other tools available, I've gone with Acrobat for the reasons mentioned above. It doesn't have the best OCR engine (ABBYY Finereader has the best), but the Clearscan function is what has always won me over for making scanned PDFs more readable. Are you using Linux, OS X, or Windows? There might be other options as well that I could refer you to. – Shawn Patrick Rice Oct 12 '14 at 19:14
  • I'm on OSX right now but I'm extremely OS agnostic when it comes to finding a solution that will work for this. I just first went to unix imagemagick since I thought that was the defacto standard, however I know I could do this manually using irfanview on Windows then printing to PDF. Even if I were using Acrobat, how complicated is the process you mention? I could probably just download a trial of it for this job if it's going to definitely work. – Jens Bodal Oct 12 '14 at 19:24
  • 1
    It's fairly simple. You'd just create a new PDF from the documents (`Combine Files into PDF`), then use Document Processing -> Optimize Scanner PDF (wait for a while for that to finish), then Text Recognition -> In this file and play with the settings (make sure you use Clearscan), and that's it. You might have to enable the tools to make them appear (I forget how off the top of my head). But you can play with the settings in each to see what happens. While you can combine the last two steps, I find I get better results separating them. – Shawn Patrick Rice Oct 12 '14 at 19:52
  • Great thanks for the tip. I'm going to do the crappy way which kinda works using imagemagick for now, but if I have time next week I'll get Adobe going and try it that way. Atm I can get 25% of the original size with ImageMagick, but I should be able to get way smaller than that. – Jens Bodal Oct 12 '14 at 21:50
  • Can you maybe post a link to one of your PNG files? – Mark Setchell Oct 13 '14 at 09:15
  • Hi Mark, unfortunately I am pretty sure I shouldn't, so I am not going to. I tried looking on Google image search for something similar, a high resolution PNG with lots of text, but couldn't really find something. For this I am just going to use a trial of Adobe Acrobat and do it that way this time. – Jens Bodal Oct 14 '14 at 20:03

2 Answers2

3

The excellent ImageMagick Examples state that by default, no image compression is used when creating PDFs and suggest to use Zip (Deflate Compression):

convert *.png -compress Zip -quality 100 kapittel1.pdf

If your images are only black and white, you can try the -monochrome option and optionally Group4 (Fax) compression using -compress Group4.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113
  • Thanks but the `-compress Zip` had seemingly no effect on the output, resulting in the exact same size/quality as without it. Unfortunately text is not black and white. What seems to work best though is `-scale` and not going below 25%. – Jens Bodal Oct 12 '14 at 20:21
  • Hmm, I was also a bit puzzled by the statement that ImageMagick doesn't use image compression in PDFs by default. Maybe that has changed in a newer version. If you see a drastic change in quality after scaling down to a certain size, it's probably a bug. You could try to resize the images to intermediate files first and create the PDF in a second step. Also note that `-resize` generally produces better output than `-scale`. – nwellnhof Oct 12 '14 at 21:25
  • Could just be my eyesight, I don't know, but I found better quality with `-scale` vs `-resize`, could just be that they were near identical and I wanted to believe I found a better way to do it. It must be a bug, as if I were to just use IrfanView in Windows to reduce the size by 20% then print to PDF it worked just fine. For now I'm just going to scale to 25% of the original and deal with the large multiple PDF files. – Jens Bodal Oct 12 '14 at 21:48
2

Ok well I discovered that the size of the PDF once following Shawn Patrick Rice's suggestion for Optimizing Scanned PDFs and OCR+ClearText was fairly negligible between a -resize setting of 30-50%. The primary goal here is to reduce the size of the resulting PDF to under 45" in height as this is the threshold for Adobe's OCR. I found no benefit from converting each image individually to a PDF then resizing, or playing with the plethora of other settings in Adobe. The below process kept (as far as I can tell) all of the image quality and reduces the images to the smallest size PDF (at full quality).

My process was as follows:

convert *.png -resize 50% name.pdf 
// resize amount dependent on original file dimensions, goal is document height < 45"
Adobe Acrobat => Document Processing => Optimize Scanned PDF (Edit => ClearScan output style) => OK

The size of the resulting PDF document is still quite large, however the size after reducing in Adobe goes down considerably (90MB => 4MB). If I first resized at 30% there would be noticeable image quality loss, however the amount of size I would save after optimizing would be around 800KB for the above file.

Jens Bodal
  • 1,707
  • 1
  • 22
  • 32