Why would gnu parallel chunking improve gzip's compression size?

Question

File under: "Unexpected Efficiency Dept."

The first 90 million numbers take up about 761MB, as output by:

 seq 90000000

According to man parallel, it can speed up gzip's archiving big files by chopping the input up, and using different CPUs to compress the chunks. So even though gzip is single-threaded this technique makes it multi-threaded:

seq 90000000  | parallel --pipe --recend '' -k gzip -9 >bigfile.gz

Took 46 seconds, on an Intel Core i3-2330M (4) @ 2.2GHz.

Pipe that to plain old gzip:

seq 90000000  | gzip -9 > bigfile2.gz

Took 80 seconds, on the same CPU. Now the surprise:

ls -log bigfile*.gz

Output:

-rw-rw-r-- 1 200016306 Jul  3 17:27 bigfile.gz
-rw-rw-r-- 1 200381681 Jul  3 17:30 bigfile2.gz

300K larger? That didn't look right. First I checked with zdiff if the files had the same contents -- yes, the same. I'd have supposed any compressor would do better with a continuous data stream than a chunked one. Why isn't bigfile2.gz smaller than bigfile.gz?

Interestingly on my iMac, `bigfile2.gz` comes out smaller and the elapsed time is almost identical for parallel and standard invocation. — Mark Setchell, Jul 04 '16 at 09:12
@MarkSetchell For some reason the Mac OS X `seq` does not produce the same output. You can try `jot` instead. — Mark Adler, Jul 04 '16 at 14:00
It may be relevant to note that `pigz` comes out smaller and faster than `parallel`+`gzip` (198345773 here, against 200381681 from `gzip`, and 52s user and 6½s real, against 36½s user and real). — Toby Speight, Jul 04 '16 at 15:35
`parallel --pipe` is inefficient. Use `parallel --pipepart` if possible (it is not in this case, because you read from a pipe, but it you had a file, --pipepart would be faster). — Ole Tange, Jul 05 '16 at 06:46

score 9 · Accepted Answer · edited Oct 10 '18 at 17:34

9

The reason is that for this particular, rather unusual input, smaller deflate blocks are better than larger ones. By default gzip uses larger deflate blocks, since that works best for normal input data. The parallel command is forcing a few smaller deflate blocks by breaking up the input every 1 MB, resulting in a small gain. Though most of the blocks are still the same size.

You can do much better by setting a smaller block size for every block by using zlib's memLevel parameter in deflateInit2(). Here I compress the same output in a single thread each time, using memLevel values from 9 to 2, where a smaller memLevel is a smaller deflate block size (note that zlib does a little better than your gzip at the default level):

9 - 199688429
8 - 198554111 (default)
7 - 191582070
6 - 184880482
5 - 181295029
4 - 180137425 (optimum for this input)
3 - 181176610
2 - 185759115

The optimum memLevel for this data turns out to be 4, for which the compressed data is 12 MB (9%) smaller than for the default memLevel of 8. For memLevel 8, the deflate block size is 16383 symbols, whereas for memLevel 4, the deflate block size is 1023 symbols. One symbol is either a literal byte or a match.

The improvement comes from the extremely regular nature of the input, resulting in a regular sequence of match and literal commands. The smaller the block size, the fewer such distinct commands that appear, which then takes fewer bits to code each of them. This is still true for memLevel 3, but by then the overhead of the code description at the beginning of each deflate block cancels the improvement from fewer distinct codes.

zopfli is a deflate compressor that optimizes the block size and the commands selected, and managed to compress it to 100,656,812 bytes. It took three and a half hours though! zopfli is invoked with pigz using compression level 11.

edited Oct 10 '18 at 17:34

agc

7,973
2
29
50

answered Jul 04 '16 at 15:10

Mark Adler

101,978
13
118
158

Just to be clear, the **zlib** `memlevel` *2-9* options are _not_ the same as the `gzip`'s compression speed `-#` (*1-9*) options, correct? – agc Jul 04 '16 at 15:41
1

Correct. The 1-9 is a compression level, which controls how hard the compressor searches for matching strings. In fact, for this input the default level of 6 compresses better than 9! But that's a story for another time. – Mark Adler Jul 04 '16 at 15:45
Something about this type of data makes 1023 symbols better. Would a finer grained setting (say 1013 symbols, etc.) compress to some smaller optimum? Also is the 1023 peculiar to the *size* of the data set, that is, would 1023 symbols remain optimal if there were 9 million numbers, or 900 million? Answer: Testing some smaller values than 90 mil., 9mil., 900K, 90K: `parallel` generally seems to do a little better than `gzip`. 900 Mil. also gives `parallel` the minor win. – agc Jul 04 '16 at 20:27
You could do better with a smaller block size, if fewer distinct commands were used. I am imagining constructing a deflate stream by hand for this data, and it would have very small blocks with one number to introduce each new sequence of a 1000 numbers, and then a block with just matches for the other 999. See my note on zopfli, which optimized this. I'll check later what block sizes it used. – Mark Adler Jul 04 '16 at 23:34
Turns out `parallel` has a `-block ` option, which sets the chunk size. Testing on a list of 90000 (half a meg of data), the best block size for compression is about 1024 bytes, but the overhead for `parallel`'s splitting and whatnot makes it take 40x longer. – agc Jul 05 '16 at 03:40
actually, this is very interesting. probably it is possible to recompress 9*9 times and see which is the best result... – Shimon Doodkin Jul 05 '16 at 11:38
@ShimonDoodkin Go for it. – Mark Adler Jul 05 '16 at 15:55

score 0 · Answer 2 · edited Jul 04 '16 at 17:09

0

I think it is frequency of dictionary making, which is different. This is the balance between speed and compression efficiency, like gzip vs lzma.

I guess it is more frequent in the split case. So the numbers of the dictionary are more similar to the following.

There was one 20 minute lecture on YouTube, Raul Fraile: How GZIP compression works | JSConf EU 2014.

edited Jul 04 '16 at 17:09

agc

7,973
2
29
50

answered Jul 04 '16 at 09:06

Shimon Doodkin

4,310
34
37

Re: "The following." It is not too clear what noun-object the *following* signifies. Sorry, but the Raul Fraile lecture, delivered with a thick Spain accent in a timid soft monotone by a self confessed non-expert in compression, is too slow for my American ears accustomed to fast talkers -- it would be better to just quote the part you think is relevant, or link to only the most relevant segment of the video. – agc Jul 04 '16 at 16:32

score 0 · Answer 3 · answered Jul 04 '16 at 15:54

0

The effect is likely due to compression block size. Compressing the same input stream with a range of settings like this:

for i in {1..9}; do seq 90000000 | gzip -$i >$i.gz; done

gives file sizes that reach a minimum at gzip -5:

-rw-r--r-- 1 203473375 Jul  4 16:39 1.gz
-rw-r--r-- 1 201160853 Jul  4 16:40 2.gz
-rw-r--r-- 1 200181562 Jul  4 16:40 3.gz
-rw-r--r-- 1 204266147 Jul  4 16:40 4.gz
-rw-r--r-- 1 199144028 Jul  4 16:40 5.gz
-rw-r--r-- 1 199688429 Jul  4 16:40 6.gz
-rw-r--r-- 1 199689546 Jul  4 16:41 7.gz
-rw-r--r-- 1 200376213 Jul  4 16:41 8.gz
-rw-r--r-- 1 200381681 Jul  4 16:42 9.gz

That's not far off gzip's default of -6.

answered Jul 04 '16 at 15:54

Toby Speight

27,591
48
66
103

1

No, that is not the effect here. The compression level is not being changed. Furthermore, the compression level does not change the block size. You are seeing yet another effect, which is the higher compression level finding longer matches, but that improvement being countered by a larger number of distinct lengths and distances, requiring more bits per match to code. – Mark Adler Jul 04 '16 at 16:00
I thought that the gzip program changed block size when it set compression level, but I now sit corrected. Thanks @Mark for correcting me! – Toby Speight Jul 04 '16 at 16:03
Trivia: wasting 15 minutes of CPU making a comparative `parallel` vs plain `gzip` table, `time for f in {1..9} ; do echo $f" " $(seq 90000000 | gzip -$f | wc -c) " " $(seq 90000000 | parallel --pipe --recend '' -k gzip -$f | wc -c) ; done`, reveals that plain `gzip` is a little smaller for `-1` through `-3`, and larger thereafter. `parallel` reaches its minimum with `gzip -5` at *198735045* bytes. – agc Jul 04 '16 at 16:40
More trivia: Adding `pigz` to that loop, `$(seq 90000000 | pigz -$f | wc -c)`, shows it's sweet spot is also `-5` at 197271587 bytes. `pigz` is smallest every time, except for `-2` where it comes in 2nd place after `gzip`. – agc Jul 04 '16 at 16:53

Why would gnu parallel chunking improve gzip's compression size?

3 Answers3