File under: "Unexpected Efficiency Dept."
The first 90 million numbers take up about 761MB, as output by:
seq 90000000
According to man parallel
, it can speed up gzip
's archiving big files by chopping the input up, and using different CPUs to compress the chunks. So even though gzip
is single-threaded this technique makes it multi-threaded:
seq 90000000 | parallel --pipe --recend '' -k gzip -9 >bigfile.gz
Took 46 seconds, on an Intel Core i3-2330M (4) @ 2.2GHz.
Pipe that to plain old gzip
:
seq 90000000 | gzip -9 > bigfile2.gz
Took 80 seconds, on the same CPU. Now the surprise:
ls -log bigfile*.gz
Output:
-rw-rw-r-- 1 200016306 Jul 3 17:27 bigfile.gz
-rw-rw-r-- 1 200381681 Jul 3 17:30 bigfile2.gz
300K larger? That didn't look right. First I checked with zdiff
if the files had the same contents -- yes, the same. I'd have supposed any compressor would do better with a continuous data stream than a chunked one. Why isn't bigfile2.gz
smaller than bigfile.gz
?