5

I need to reduce the file size of a color scan.

Up to now I think the following steps should be made:

  • selective blur (or similar) to reduce noise
  • scale to ~120dpi
  • reduce colors

Up to now we use convert (imagemagick) and net-ppm tools.

The scans are invoices, not photos.

Any hints appreciated.

Update

example:

Bounty

The smallest and good readable reduced file of example.png with a reproduce-able solution gets the bounty. The solution needs to use open source software only.

The file format is not important, as long as you can convert it to PNG again. Processing time is not important. I can optimize later.

Update

I got very good results for black-and-white output (thank you). Color reducing to about 16 or 32 colors would be interesting.

Jørgen R
  • 10,568
  • 7
  • 42
  • 59
guettli
  • 25,042
  • 81
  • 346
  • 663
  • Related question on github issues from vips library: https://github.com/jcupitt/libvips/issues/9 – guettli Jan 27 '12 at 13:07
  • What are your requirements for the final file? What needs to be able to read them? (e.g. must they be png, or can they be another format? If another format, can they be a proprietary format or a often unused format?) Also, what kind of scans? Are they always documents with relatively few colors? Are they ever full photos? Additionally, how much time can you spend processing them? Are they always about the dimensions of your example? – Kaganar Feb 01 '12 at 14:35
  • I updated the example. Let's get example.png as small as possible. More colors, other dimension or output format are not important at the moment. – guettli Feb 02 '12 at 09:45
  • 1
    Noise is usually reduced by averaging/blurring/clamping/etc, which is kind of opposite to sharpening. – Alexey Frunze Feb 03 '12 at 19:27
  • If you want the _smallest_ readable file then that's a text document! Maybe too brutal, but, it wasn't mention in your criteria of what aspects had to be kept. – Stephen Quan Feb 04 '12 at 05:47
  • @Alex, yes you are right. To reduce noise selective blur is better than sharpening. – guettli Feb 05 '12 at 15:41
  • Will the form be always the same one or will there be different ones? If the form would always be the same one possible approach would be to subtract the empty form from the filled out one and just store the difference (which will have a lot of white space and thus compress nicely). The difficult part in that case is to have the empty form register perfectly with the scanned one. – Quasimondo Feb 07 '12 at 09:30
  • @Quasimondo: No, the form will be different. And even if it would be the same, I think it would be quite difficult to just store the different. The scanner will read the image different, even if you scan the same document twice. But, thank you for this idea. – guettli Feb 07 '12 at 21:30

3 Answers3

4

This is a rather open ended question since there's still possible room for flex between image quality and image size... after all, making it black and white and compressing it with CCITT T.6 black and white (fax-style) compression is going to beat the pants off most if not all color-capable compression algorithms.

If you're willing to go black and white (not grayscale), do that! It makes documents very small.

Otherwise I recommend a series of minor image transformations and Adaptive Prediction Trees (see here). The APT software package is opensource or public domain and very easy to compile and use. Its advantages are that it performs well on a wide variety of image types, especially text, and it will allow you to scale image size vs. image quality better without losing readability. (I found myself squishing a example_1000-sized color version down to 48KB on the threshold of readability, and 64K with obvious artifacts but easy readability.)

I combined APT with imagemagick tweakery:

convert example.png -resize 50% -selective-blur 0x4+10% -brightness-contrast -5x30 -resize 80% example.ppm
./capt example.ppm example.apt 20  # The 20 means quality in the range [0,100]

And to reverse the process

./dapt example.apt out_example.ppm
convert out_example.ppm out_example.png

To explain the imagemagick settings:

  • -resize 50% Make it half as small to make processing faster. Also hides some print and scan artifacts.
  • -selective-blur 0x4+10%: Sharpening actually creates more noise. What you actually want is a selective blur (like in Photoshop) which blurs when there's no "edge".
  • -brightness-contrast -5x30: Here we increase the contrast a good bit to clip the bad coloration caused by the page outline (leading to less compressible data). We also darken slightly to make the blacks blacker.
  • -resize 80% Finally, we resize to a little bigger than your example_1000 image size. (Close enough.) This also reduces the number of obvious artifacts since they're somewhat hidden when the pixels are merged together.

At this point you're going to have a fine looking image in this example -- nice, smooth colors and crisp text. Then we compress. The quality value of 20 is a pretty low setting and it's not as spiffy looking anymore, but the document is very legible. Even at a quality value of 0 it's still mostly legible.

Again, using ADT isn't going to necessarily lead to the best results for this image, but it won't turn into an entirely unrecognizable mess on photographic-like content such as gradients, so you should be covered better on more types or unexpected types of documents.

Results: 88kb 76kb 64kb 48kb

Processed image before compression

Kaganar
  • 6,540
  • 2
  • 26
  • 59
  • 1
    The typical obvious optimization when dealing with imagemagick: Don't use imagemagick! It's slow. If you're brave, programming what we're doing with imagemagick isn't too hard except for maybe the resize. (Maybe the GD graphics libary will make things simpler?) – Kaganar Feb 02 '12 at 17:01
  • Thank you for your answer. I gave you the bounty. – guettli Feb 07 '12 at 21:33
4

If you truly don't care about the number of colors, we may as well go to black-and-white and use a bilevel coder. I ended up using the DJVU format because it compares well to JBIG2 and has open source encoders. In this case I used the didjvu encoder because it achieved the best results. (On Ubuntu you can apt-get install didjvu, perhaps on other distributions as well.)

The magic I ended up with looks like this to encode:

convert example.png -resize 50% -selective-blur 0x4+10% -normalize -brightness-contrast -20x100 -dither none -type bilevel example_djvu.pgm
didjvu encode -o example.djvu example_djvu.pgm --lossless

Note that this is actually a superior color blur to 0x2+10% at full resolution -- this will end up making the imagine about as nice as imaginable before it's converted to a bilevel image.

Decoding works as follows:

convert example.djvu out_example.png

Even with the larger resolution (which is much easier to read), the size weights in at 24KB. When reduced to the same size, it's still 24KB! Lastly, at only a 75% of the original image reduction and a 0x5+10% blur it weights in at 32KB.

See here for the visual results: http://img29.imageshack.us/img29/687/exampledjvu.png

Kaganar
  • 6,540
  • 2
  • 26
  • 59
2

If you already have it doing the right thing with the Imagemagick utility "convert" then it might be a good idea to look at the Imagemagick libraries first.

A quick look at my Ubuntu package lists shows bindings for perl,python,ruby,c++ and java

Vorsprung
  • 32,923
  • 5
  • 39
  • 63