scale and reduce colors to reduce file size of scan

Question

I need to reduce the file size of a color scan.

Up to now I think the following steps should be made:

selective blur (or similar) to reduce noise
scale to ~120dpi
reduce colors

Up to now we use convert (imagemagick) and net-ppm tools.

The scans are invoices, not photos.

Any hints appreciated.

Update

example:

http://www.thomas-guettler.de/tbz/example.png 11M
http://www.thomas-guettler.de/tbz/example_0800_pnmdepth009.png pnmscale, pnmdepth 110K
http://www.thomas-guettler.de/tbz/example_1000_pnmdepth006.png pnmscale, pnmdepth 116K

Bounty

The smallest and good readable reduced file of example.png with a reproduce-able solution gets the bounty. The solution needs to use open source software only.

The file format is not important, as long as you can convert it to PNG again. Processing time is not important. I can optimize later.

Update

I got very good results for black-and-white output (thank you). Color reducing to about 16 or 32 colors would be interesting.

Related question on github issues from vips library: https://github.com/jcupitt/libvips/issues/9 — guettli, Jan 27 '12 at 13:07
What are your requirements for the final file? What needs to be able to read them? (e.g. must they be png, or can they be another format? If another format, can they be a proprietary format or a often unused format?) Also, what kind of scans? Are they always documents with relatively few colors? Are they ever full photos? Additionally, how much time can you spend processing them? Are they always about the dimensions of your example? — Kaganar, Feb 01 '12 at 14:35
I updated the example. Let's get example.png as small as possible. More colors, other dimension or output format are not important at the moment. — guettli, Feb 02 '12 at 09:45
Noise is usually reduced by averaging/blurring/clamping/etc, which is kind of opposite to sharpening. — Alexey Frunze, Feb 03 '12 at 19:27
If you want the _smallest_ readable file then that's a text document! Maybe too brutal, but, it wasn't mention in your criteria of what aspects had to be kept. — Stephen Quan, Feb 04 '12 at 05:47
@Alex, yes you are right. To reduce noise selective blur is better than sharpening. — guettli, Feb 05 '12 at 15:41
Will the form be always the same one or will there be different ones? If the form would always be the same one possible approach would be to subtract the empty form from the filled out one and just store the difference (which will have a lot of white space and thus compress nicely). The difficult part in that case is to have the empty form register perfectly with the scanned one. — Quasimondo, Feb 07 '12 at 09:30
@Quasimondo: No, the form will be different. And even if it would be the same, I think it would be quite difficult to just store the different. The scanner will read the image different, even if you scan the same document twice. But, thank you for this idea. — guettli, Feb 07 '12 at 21:30

Kaganar · Accepted Answer · 2012-02-03T19:31:42.707

This is a rather open ended question since there's still possible room for flex between image quality and image size... after all, making it black and white and compressing it with CCITT T.6 black and white (fax-style) compression is going to beat the pants off most if not all color-capable compression algorithms.

If you're willing to go black and white (not grayscale), do that! It makes documents very small.

Otherwise I recommend a series of minor image transformations and Adaptive Prediction Trees (see here). The APT software package is opensource or public domain and very easy to compile and use. Its advantages are that it performs well on a wide variety of image types, especially text, and it will allow you to scale image size vs. image quality better without losing readability. (I found myself squishing a example_1000-sized color version down to 48KB on the threshold of readability, and 64K with obvious artifacts but easy readability.)

I combined APT with imagemagick tweakery:

convert example.png -resize 50% -selective-blur 0x4+10% -brightness-contrast -5x30 -resize 80% example.ppm
./capt example.ppm example.apt 20  # The 20 means quality in the range [0,100]

And to reverse the process

./dapt example.apt out_example.ppm
convert out_example.ppm out_example.png

To explain the imagemagick settings:

-resize 50% Make it half as small to make processing faster. Also hides some print and scan artifacts.
-selective-blur 0x4+10%: Sharpening actually creates more noise. What you actually want is a selective blur (like in Photoshop) which blurs when there's no "edge".
-brightness-contrast -5x30: Here we increase the contrast a good bit to clip the bad coloration caused by the page outline (leading to less compressible data). We also darken slightly to make the blacks blacker.
-resize 80% Finally, we resize to a little bigger than your example_1000 image size. (Close enough.) This also reduces the number of obvious artifacts since they're somewhat hidden when the pixels are merged together.

At this point you're going to have a fine looking image in this example -- nice, smooth colors and crisp text. Then we compress. The quality value of 20 is a pretty low setting and it's not as spiffy looking anymore, but the document is very legible. Even at a quality value of 0 it's still mostly legible.

Again, using ADT isn't going to necessarily lead to the best results for this image, but it won't turn into an entirely unrecognizable mess on photographic-like content such as gradients, so you should be covered better on more types or unexpected types of documents.

Results: 88kb 76kb 64kb 48kb

Processed image before compression

The typical obvious optimization when dealing with imagemagick: Don't use imagemagick! It's slow. If you're brave, programming what we're doing with imagemagick isn't too hard except for maybe the resize. (Maybe the GD graphics libary will make things simpler?) — Kaganar, Feb 02 '12 at 17:01

Kaganar · Answer 2 · 2012-02-03T19:36:36.287

If you truly don't care about the number of colors, we may as well go to black-and-white and use a bilevel coder. I ended up using the DJVU format because it compares well to JBIG2 and has open source encoders. In this case I used the didjvu encoder because it achieved the best results. (On Ubuntu you can apt-get install didjvu, perhaps on other distributions as well.)

The magic I ended up with looks like this to encode:

convert example.png -resize 50% -selective-blur 0x4+10% -normalize -brightness-contrast -20x100 -dither none -type bilevel example_djvu.pgm
didjvu encode -o example.djvu example_djvu.pgm --lossless

Note that this is actually a superior color blur to 0x2+10% at full resolution -- this will end up making the imagine about as nice as imaginable before it's converted to a bilevel image.

Decoding works as follows:

convert example.djvu out_example.png

Even with the larger resolution (which is much easier to read), the size weights in at 24KB. When reduced to the same size, it's still 24KB! Lastly, at only a 75% of the original image reduction and a 0x5+10% blur it weights in at 32KB.

See here for the visual results: http://img29.imageshack.us/img29/687/exampledjvu.png

score 2 · Answer 3 · answered Jan 25 '12 at 08:21

2

If you already have it doing the right thing with the Imagemagick utility "convert" then it might be a good idea to look at the Imagemagick libraries first.

A quick look at my Ubuntu package lists shows bindings for perl,python,ruby,c++ and java

answered Jan 25 '12 at 08:21

Vorsprung

32,923
5
39
63

Imagemagick is quite slow, and does not (AFAIK) use multiple CPUs. – guettli Jan 25 '12 at 09:07
You didn't say anything about speed in your requirements. – Fantius Feb 01 '12 at 17:34
Yes, I changed my mind. At first I want the image to be as small as possible. I can optimize later. – guettli Feb 02 '12 at 09:46

scale and reduce colors to reduce file size of scan

3 Answers3