2

I have a multi-page PDF with photographed book pages. I want to remove gradients from every page to prepare for optical character recognition.

This command works fine on a PNG of a single page:

convert page.png \( +clone -blur 0x64 \) -compose minus -composite -channel RGB -negate page_deblurred.png

However, as soon as I try this on a multi-page PDF by using this command...

convert full.pdf \( +clone -blur 0x64 \) -compose minus -composite -channel RGB -negate full_deblurred.pdf

...I get a single-page PDF with inversed colors overlaid with text from several pages.

How do I tell imagemagick to process every page like it does with the PNG and return a multi-page PDF to me?

303
  • 888
  • 1
  • 11
  • 31

3 Answers3

1

It seems unlikely you'd want to pass a PDF to OCR, since Tesseract et al prefer PNG or NetPBM PPM files, so you might as well split your big PDF into individual PNG (or other) files:

convert full.pdf page-%03d.png

You can now remove gradients on individual pages, one at a time, and pass to OCR. Or you can use GNU Parallel to do them in parallel - please say if this in an option and I will write it up for you, if so.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • Thank you for your answer! Actually I'd prefer to use pdfsandwich which indeed takes PDFs as input and has some options built-in that I'd like to use. – 303 Mar 14 '20 at 21:30
  • 1
    Ok, I've not heard of pdfsandwich, but I guess I would still split the PDF into separate pages as above, remove the gradient from pages individually either in a `for` loop or in parallel, then recombine into a multi-page PDF for your preferred tool. Let me know if you like that approach and want a hand with any aspects. – Mark Setchell Mar 14 '20 at 21:40
  • If Imagemagick can't do the job on its own I'd happily accept every answer that produces the output that I want. I have trouble seeing the benefits of GNU Parallel in this case though since Imagemagick already uses all of my CPU cores when processing just a single file. – 303 Mar 14 '20 at 22:08
1

As imagemagick does not seem to be capable to do this in one shot, I put together a script based on the suggestion Mark Setchell made in a comment to his answer.

#!/usr/bin/bash

set -e

tmpdir=$(mktemp -d)

echo "Splitting PDF into single pages"
convert -density 288 "$1" "${tmpdir}/page-%03d.png"
for f in "$tmpdir"/page-*.png
do
    echo "Processing ${f##*/}"
    convert "$f" \( +clone -blur 0x64 \) -compose minus -composite -channel RGB -negate "$(printf "%s%s" "$f" "_gradient_removed.png")"
done
pdf_file_name_without_suffix="${1%.pdf}"
echo "Reassembling PDF"
convert "$tmpdir"/*_gradient_removed.png -quality 100 "$pdf_file_name_without_suffix"_gradient_removed.pdf

rm -rf "${tmpdir}"

It works fine with my material. Your mileage may vary.

303
  • 888
  • 1
  • 11
  • 31
1

This should do what you want in ImageMagick in one command line. You have to use -layers composite and separate your pdf and your blur processing by null:. This is the same process as merging animated gifs.

convert -density 288 image.pdf -write mpr:img null: \( mpr:img -blur 0x64 \) -compose minus -layers composite -channel RGB -negate -resize 25% image_deblurred.pdf


fmw42
  • 46,825
  • 10
  • 62
  • 80
  • Thank you for your answer! It works, but the output quality is too low for OCR. I took the liberty to submit a an edit adding the flag `-density 288` after `convert`. This massively increases the computation time, but the results are much better. One can use this flag to control the runtime-to-quality ratio. – 303 Mar 19 '20 at 15:54
  • 1
    I have further edited the answer so that the output size is not changed by the -density 288. Adding -resize 25% compensates for the -density 288 size increase. It will still take as long to process as without the -resize. – fmw42 Mar 19 '20 at 16:44